Machine learning based activity detection utilizing reconstructed 3d arm postures

ABSTRACT

A method comprises obtaining data from an inertial measurement unit of a user, generating three-dimensional (3D) arm pose estimates from the obtained data, applying the generated 3D arm pose estimates to a machine learning system trained to recognize temporal-spatial patterns of one or more designated activities, and obtaining at least one classification output from the machine learning system. The machine learning system illustratively comprises at least one support vector machine (SVM) model. Applying the generated 3D arm pose estimates to the machine learning system illustratively comprises extracting possible intake gestures of the generated 3D arm pose estimates into respective segments, resampling each of at least a subset of the extracted segments, and utilizing the SVM model to classify whether or not each of one or more of the extracted and resampled segments comprises an intake gesture.

RELATED APPLICATIONS

The present application is a continuation-in-part of PCT International Application No. PCT/US21/29189, filed Apr. 26, 2021 and entitled “Deep Continuous 3D Hand Pose Tracking,” which is incorporated by reference herein in its entirety, and which claims priority to and fully incorporates by reference U.S. Provisional Patent Application Ser. No. 63/015,381, filed Apr. 24, 2020 and entitled “FingerTrak: Deep Continuous 3D Hand Pose Tracking,” also incorporated by reference herein in its entirety.

FIELD

The field relates generally to information processing systems, and more particularly to machine learning based techniques implemented in such systems.

BACKGROUND

Activity detection is of particular importance in a wide variety of different fields. For example, one of the main factors that plays a role in obesity is unhealthy eating behavior. In other words, if people intake more (healthy or unhealthy) food (calculated in calories) than their body requires, they will gain weight. If they do nothing, or not enough, to combat this (such as exercise), it will lead to them becoming overweight and eventually obese. To help understand and potentially reduce issues caused by unhealthy eating, conventional practice typically requires precise recording of all eating-related events throughout the day (food journaling). The traditional approach for food journaling usually requires the user to manually log the eating activities (e.g., on a smartphone app). This traditional food journaling approach heavily depends on the user's self-motivation and determination, and is known to be inaccurate, inefficient and unsustainable in practice. Accordingly, a need exists for improved techniques for detecting eating behaviors and other activities.

SUMMARY

Illustrative embodiments provide machine learning based activity detection utilizing reconstructed three-dimensional (3D) arm postures. Arm postures are also referred to herein as “arm poses,” and estimates thereof are obtained in some embodiments utilizing data from an inertial measurement unit (IMU) in a smartwatch or other wearable device.

Although suitable for use in detecting a broad array of different types of activities, the disclosed techniques in some embodiments are illustratively configured for early detection of potentially harmful eating behaviors and can facilitate timely intervention, behavior modification and/or treatment. For example, in some embodiments, a machine learning based activity detection system referred to herein as “EatingTrak” is disclosed. This example system is implemented at least in part as a wearable technology using one or more wearable devices, such as a commodity smartwatch, and is configured to detect fine-grained eating and drinking activities at a per-intake level in a variety of operating environments, including free-living scenarios.

EatingTrak in some embodiments illustratively reconstructs 3D arm postures using data obtained from an IMU in the smartwatch, and then recognizes fine-grained eating activities from the series of estimated 3D arm postures. Multiple IMUs and/or other types of data sources can be used in other embodiments. The example system in some embodiments first estimates 3D arm posture from the IMU data. Then it uses a machine learning system to detect fine-grained eating activities from the estimated arm postures. Various types of intervention, behavior modification and/or treatment are automatically driven by outputs of the machine learning system, in an accurate and efficient manner, advantageously leading to significantly improved user outcomes relative to conventional approaches.

In one embodiment, a method illustratively comprises obtaining data from an IMU of a user, generating 3D arm pose estimates from the obtained data, applying the generated 3D arm pose estimates to a machine learning system trained to recognize temporal-spatial patterns of one or more designated activities, and obtaining at least one classification output from the machine learning system.

In some embodiments, generating 3D arm pose estimates from the obtained data comprises, for each of a plurality of body directions, calculating wrist orientation relative to a torso coordinate system using wrist orientation relative to an earth coordinate system and the body direction, looking up the calculated wrist orientation relative to the torso coordinate system in a weighted dictionary generated using actual arm poses to determine a weighted point cloud, and utilizing the weighted point cloud to assign a probability to the body direction based at least in part on weights of the weighted point cloud.

Generating 3D arm pose estimates may additionally comprise selecting a particular one of the body directions based at least in part on their respective assigned probabilities, utilizing the selected body direction to transform the wrist orientation relative to the earth coordinate system to determine a derived wrist orientation relative to the torso coordinate system, looking up the derived wrist orientation relative to the torso coordinate system in the weighted dictionary to determine a weighted point cloud, and determining a given one of the 3D arm pose estimates based at least in part on weights of the weighted point cloud.

In some embodiments, the machine learning system illustratively comprises at least one support vector machine (SVM) model, although a wide variety of additional or alternative machine learning models can be used, including machine learning systems based on convolutional neural networks, deep neural networks, and other types of neural networks.

Applying the generated 3D arm pose estimates to the machine learning system illustratively comprises extracting possible intake gestures of the generated 3D arm pose estimates into respective segments, resampling each of at least a subset of the extracted segments, and utilizing the SVM model to classify whether or not each of one or more of the extracted and resampled segments comprises an intake gesture.

Some embodiments are configured to detect and remediate harmful eating behaviors, and/or to detect and encourage healthy eating behaviors, although the disclosed techniques can additionally or alternatively be used to provide activity detection and intervention for a wide variety of other behavior-related conditions.

It is to be appreciated that the foregoing arrangements are only examples, and numerous alternative arrangements are possible.

These and other illustrative embodiments include but are not limited to systems, methods, apparatus, processing devices, integrated circuits, and computer program products comprising processor-readable storage media having software program code embodied therein.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 illustrates an example human to computer interface that includes one or more cameras worn on a wrist of a user, in accordance with some embodiments;

FIG. 2 illustrates an example human to computer interface including one or more cameras mounted on a band, in accordance with some embodiments;

FIG. 3 illustrates a sample machine learning algorithm for estimating a hand pose, in accordance with some embodiments;

FIGS. 4A, 4B, 4C, and 4D illustrate steps of capturing images from a wrist-worn imaging sensor, including capturing ground truth, images from the wrist-worn imaging sensor, and a prediction of an actual hand pose, in accordance with some embodiments;

FIGS. 5A and 5B illustrate a kinematic model of a hand and the degrees of freedom of a finger joint, in accordance with some embodiments;

FIG. 6 illustrates continuous hand pose estimation results based on wrist-mounted image capture; in accordance with some embodiments;

FIG. 7A illustrates sample images captured from a wrist-mounted imaging sensor, in accordance with some embodiments;

FIG. 7B illustrates a sample predicted hand pose based on the images of FIG. 7A, in accordance with some embodiments;

FIG. 8 illustrates an example embodiment of a wrist-mounted system for continuously estimating a hand pose of a user, in accordance with some embodiments;

FIG. 9 illustrates an example embodiment of a wrist-mounted system for continuously estimating a hand pose of a user, in accordance with some embodiments;

FIG. 10 illustrates a process diagram for determining a hand pose, in accordance with some embodiments;

FIG. 11 illustrates an information processing system configured with functionality for machine learning based activity detection utilizing reconstructed 3D arm postures, in accordance with some embodiments;

FIG. 12 illustrates an example coordinate system with five degrees of freedom, in accordance with some embodiments;

FIG. 13 illustrates body direction estimation during daily activities, in accordance with some embodiments;

FIG. 14 illustrates intake of segmentation, in accordance with some embodiments; and

FIG. 15 illustrates an information processing system comprising a processing platform implementing a deep learning system for activity detection, in accordance with some embodiments.

DETAILED DESCRIPTION

Illustrative embodiments can be implemented, for example, in the form of information processing systems comprising one or more processing platforms each having at least one computer, server or other processing device. A number of examples of such systems will be described in detail herein. It should be understood, however, that embodiments of the invention are more generally applicable to a wide variety of other types of information processing systems and associated computers, servers or other processing devices or other components. Accordingly, the term “information processing system” as used herein is intended to be broadly construed so as to encompass these and other arrangements.

Some embodiments disclosed herein relate to deep continuous 3D hand pose tracking as well as other types of activity tracking.

Hand gesture research has been an ongoing topic of study for many years. For example, studies involving Human Computer Interaction (HCl) have aimed to be able to recognize hand gestures through a variety of techniques. Many traditional hand pose estimation and gesture recognition technologies rely on computer vision using cameras in the environment to capture the hand pose. However, these systems do not work well if a hand moves outside the field of view of the camera.

Other techniques for estimating hand poses have been presented based upon wearable technology, such as wearing gloves or position sensors or markers on a hand. These devices are obtrusive, and in most cases, are only able to recognize a limited number of very discrete and pre-programmed hand poses.

It would be advantageous for a system to rely on non-obtrusive hardware to enable the system to continuously capture and recognize hand poses and gestures, including fine-grained finger gestures, that allows a user to move away from a fixed position.

According to some embodiments, a system includes a wearable band configured to be coupled to an arm of a user; a first imaging sensor disposed on the wearable band, the first imaging sensor aimed to have a first field of view that is anatomically distal when the wearable band is coupled to the arm of the user; and wherein the first imaging sensor defines an optical axis, and wherein the optical axis is spaced a distance from the wearable band, the distance being less than 5 mm. Alternatively, the distance may be less than 4 mm, less than 6 mm, less than 7 mm, less than 8 mm, or less than 10 mm.

The system may further include a second imaging sensor disposed on the wearable band spaced from the first imaging sensor, the second imaging sensor aimed to have a second field of view that is anatomically distal when the wearable band is coupled to the arm of the user; and a third imaging sensor disposed on the wearable band spaced from the first imaging sensor the second imaging sensor, the third imaging sensor aimed to have a fourth field of view that is anatomically distal when the wearable band is coupled to the arm of the user. A fourth imaging sensor may similarly be disposed on the wearable band.

The first, second, and third imaging sensors may be aimed to include a hand of the user within the first, second, and third fields of view. The first, second, and third imaging sensors may optionally be aimed to have a converging field of view.

In some cases, the first, second, and third imaging sensors are substantially equally spaced about the wearable band. Optionally, a fourth imaging sensor may be integrated into the system, and the four imaging sensors may be equally spaced about the wearable band.

A computing device may be in communication with the first, second, and third imaging sensors and configured to receive image data from the imaging sensors corresponding to finger positions of the user. The communication may be wired or wireless.

In some examples, a stitching algorithm is executable by the computing device to stitch image data captured at a correlated time to create stitched images. In some examples, a 3D prediction module implementing a 3D prediction model executable by the computing device is configured to analyze the stitched images and determine a position of one or more fingertips of the user.

In some examples, a kinematic module implementing a kinematic model executable by the computing device is implemented to determine, based at least in part on the position of one or more fingertips, a position and orientation of hand joint angles. The hand joint angles may include one or more of a metacarpophalangeal joint angle, a proximal interphalangeal joint angle, a distal interphalangeal joint angle, and a radiocarpal joint angle.

According to some embodiments, a method includes the steps of receiving a first plurality of images from one or more imaging sensors located on an arm of a user, the first plurality of images captured at a same time and depicting a hand of the user, wherein the one or more imaging sensors each define an optical axis and wherein each optical axis is spaced from the arm of the user less than 8 mm; determining, based on the first plurality of images, a position of one or more fingertips of the user; and determining, based on the position of the one or more fingertips of the user, an estimation of a pose of the hand of the user.

The step of determining a position of one or more fingertips may be performed by a machine learning network (e.g., a convolutional neural network). Optionally, the step of determining an estimation of a pose of the hand of the user may be performed by inference with a skeletal and kinematic model.

In some cases, the pose of the hand of the user includes one or more of a metacarpophalangeal joint angle, a proximal interphalangeal joint angle, a distal interphalangeal joint angle, and a radiocarpal joint angle.

The method may further include the steps of determining continuous hand tracking of the user by: receiving a second plurality of images from the plurality of imaging sensors, the second plurality of images captured at a second time; and determining an estimated pose of the hand of the user at the second time.

According to some embodiments, a method for tracking the position of a hand, includes receiving first images from one or more imaging sensors mounted to a wrist of a user; determining, based on the images, a 3D spatial position of one or more fingertips of the user; determining, based on the 3D spatial position of the one or more fingertips, a pose of the hand; receiving second images from the one or more imaging sensors mounted to the wrist of the user; and determining, based on one or more second images, a second pose of the hand.

The step of determining the 3D spatial position of one or more fingertips may be performed with a machine learning algorithm (e.g., a deep neural network). In some examples, the method further includes stitching together the images. The step of determining the pose of the hand may be performed, at least in part, on a kinematic model that infers the pose of the hand based, at least in part, on the 3D spatial position of the one or more fingertips.

Optionally, the method includes the step of determining, based at least in part on the second pose of the hand, a gesture.

Some embodiments herein provide a non-obtrusive system for continuously estimating hand pose, gestures, and finger positions. The system may include one or more cameras that may continually capture images of a hand of a user. The system may be wrist-mounted and capture images from a location that is anatomically proximal of a user's hand. In some cases, the system uses one, two, three, four, six, eight, nine, ten, or more cameras located near a wrist of a user. In some embodiments, a single camera may be used to capture images of a hand of a user and infer hand poses and finger joint positions. In some cases, a single camera is located on the back of the wrist and captures images of the back of a user's hand, and from the images showing the back of the user's hand, the system can infer hand poses and finger positions with a high degree of accuracy. In some cases, the cameras are located close to the skin of a wearer, or close to the mount that attaches the system to a wrist of a user. In some cases, an optical axis of an imaging sensor is less than about 10 mm (mm=millimeter(s)) from the wrist of a user, or less than about 8 mm, or less than about 6 mm or less than about 4 mm, or less than about 2 mm from the skin of a user. In some cases, the optical axis of an imaging sensor is less than about 2 mm, or less than about 4 mm, or less than about 5 mm, or less than about 6 mm, or less than about 8 mm, or less than about 10 mm from a band that attaches the imaging sensor to a user.

According to some embodiments, a minimally obtrusive wristband includes one or more imaging sensors that allow for continuous three-dimensional finger tracking and micro finger pose recognition. In some cases, one or more imaging sensors may be used and may be disposed about a wrist mount.

FIG. 1 illustrates an embodiment of an example human to computer interface device 100 that includes one or more cameras 102 worn on a wrist of a user. While the illustrated embodiment shows multiple cameras 102, some embodiments may include one, two, three four, five, six, seven, eight, ten or more cameras located about a users' arm or wrist. In some cases, the one or more cameras 102 are located anatomically proximal of the wearer's hand and are aimed to capture images of a hand of the user. The cameras 102 may be coupled to a mount 104 for securing the device to a user. In some cases, the mount 104 may comprise a wristband 104. The interface device 100 is referred to in some embodiments as system 100. The cameras 102 are referred to in some embodiments as imaging sensors 102.

The wristband 104 may be any suitable wristband, and may include, without limitation, an elastic or non-elastic band, and may be selectively coupled to a wearer by a releasable mechanism such as a buckle, a fastener, a snap, a clasp, a latch, hook and loop fastener, an elastomeric material, a living hinge, a spring, a biasing mechanism, or some other suitable mechanism for selectively coupling the mount to a user.

With all of the embodiments described herein, the imaging sensors 102 may be any suitable sensors for capturing images. For example, the imaging sensors 102 may include, without limitation, infrared (IR) imaging sensors, thermal imaging sensors, charge-coupled devices (CCD's), complementary metal oxide semiconductor (CMOS) devices, active-pixel sensors, radar imaging sensors, fiber optics, and other known imaging device technology. For ease in describing the various embodiments, the term “camera” will be used and is used as a broad term to include any type of sensor that can capture images whether in the visible spectrum or otherwise.

In some cases, the cameras 102 form a sensor array and the captured images (e.g., image data) from a plurality of cameras may be combined, such as by stitching, to form a stitched image that includes images (e.g., image data) from more than one camera 102.

In some cases, a sensor array may be mounted remotely from a wristband, and one or more optical fibers coupled to the wristband 104 may transmit images to the sensor array for processing, analysis, manipulation, or stitching.

In some cases, the one or more cameras 102 are low profile, meaning that they are relatively close to the mount, such as a wristband 104, to which they are mounted. The images may be acquired at a selected image frame resolution and/or an appropriate frame rate, and the resolution may comprise resolution of the one or more cameras 102 mounted on the device 100. The image frame resolution may be defined by the number of pixels in a frame. The image resolution of the one or more cameras may comprise any of the following resolutions, without limitation: 32×24 pixels; 32×48 pixels; 48×64 pixels; 160×120 pixels, 249×250 pixels, 250×250 pixels, 320×240 pixels, 420×352 pixels, 480×320 pixels, 640×480 pixels, 720×480 pixels, 1280×720 pixels, 1440×1080 pixels, 1920×1080 pixels, 2048×1080 pixels, 3840×2160 pixels, 4096×2160 pixels, 7680×4320 pixels, or 15360×8640 pixels. The resolution of the cameras may comprise a resolution within a range defined by any two of the preceding pixel resolutions, for example within a range from 32×24 pixels to 250×250 pixels (e.g., 249×250 pixels). In some embodiments, the system comprises more than one imaging sensor (e.g. camera, etc.), such as at least 2, 3, 4, 5, 6, 7, 8, 9, or 10 imaging sensors. In some embodiments, plural imaging sensors can yield high accuracy even while using low resolution imaging sensors (e.g., an imaging resolution lower than the imaging resolution of the system with only one imaging sensor). In some embodiments, at least one dimension (the height and/or the width) of the image resolution of the imaging sensors can be no more than any of the following, including but not limited to 8 pixels, 16 pixels, 24 pixels, 32 pixels, 48 pixels, 72 pixels, 96 pixels, 108 pixels, 128 pixels, 256 pixels, 360 pixels, 480 pixels, 720 pixels, 1080 pixels, 1280 pixels, 1536 pixels, or 2048 pixels. In some embodiments, the system is configured to accurately identify at least 10, or at least 12, or at least 15, or at least 20 different hand poses per user, with an average accuracy of at least 85%, at least 88%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, or higher.

The camera may have a pixel size smaller than 1 micron, 2 microns, 3 microns, 5 microns, 10 microns, 20 microns, and the like. The one or more cameras may have a footprint (e.g., a dimension in a plane parallel to a lens) on the order of 10 mm×10 mm, 8 mm×8 mm, 5 mm×5 mm, 4 mm×4 mm, 2 mm×2 mm, or 1 mm×1 mm, 0.8 mm×0.8 mm, or smaller, which allows the device to incorporate one or more cameras and locate the cameras very close to the skin of a user. The footprint of the cameras 102 may comprise a dimension defined by any two of the preceding dimensions, for example within a range from 2 mm to 8 mm, such as 3.5 mm.

The captured images from the cameras may comprise a series of image frames captured at a specific frame rate. In some embodiments, the sequence of images may be captured at standard video frame rates such as about 16p, 24p, 25p, 30p, 43p, 48p, 50p, 60p, 62p, 72p, 90p, 100p, 120p, 300p, 50i or 60i, or within a range defined by any two of the preceding values. In some embodiments, the sequence of images may be captured at a rate less than or equal to about one image every 0.0001 seconds, 0.0002 seconds, 0.0005 seconds, 0.001 seconds, 0.002 seconds, 0.005 seconds, 0.01 seconds, 0.02 seconds, 0.05 seconds, or 0.1 seconds, or 0.25 seconds, or 0.5 seconds. In some cases, the capture rate may change depending on user input and/or external conditions under the guidance of a control unit (e.g. illumination brightness).

In some embodiments, ambient light may be sufficient illumination for the device 100 to capture suitable images. In some embodiments, the device may optionally comprise a light source suitable for producing images having suitable brightness and focus. In some embodiments, the light source may include a light-emitting diode (LED), an optical fiber for illumination, a micro-LED, an IR light source, or otherwise.

The images captured by the cameras may be captured in real time, such that images are produced with reduced latency, that is, with negligible delay between the acquisition of data and the rendering of the image. Real time imaging allows the system to continuously evaluate hand and finger poses and gestures. Real time imaging may include producing images at rates of about or faster than 30 frames per second (fps) to mimic natural vision with continuity of motion.

With additional reference to FIG. 2, the device 100 is illustrated in an embodiment in which four cameras 102 are shown being space equally about a mount 104, which may be a wristband. In some cases, the mount 104 may be formed of a resilient material, or may have a portion formed of a resilient material, such as to allow the mount 104 to define a diameter that can be expanded, such as for moving the mount 104 over the hand of a user. The mount 104 may be formed of any suitable material, such as a natural material (e.g., leather, cotton, rubber, etc.), a synthetic material (e.g., plastic, synthetic textile, etc.), or a combination of materials.

In some cases, the mount 104 may have a clasp structure 202 to selectively connect a first end of the mount to a second end of the mount. The clasp structure may be any suitable structure and may include, for example, hook and loop fastener, a buckle, a snap, a tie, a magnet, or some other structure that can be selectively released to allow the mount 104 to be worn or removed by a user.

The one or more cameras 102 may be in communicative connection with a remote computing device. For example, in some cases, the cameras 102 may have a wired connection 204 that allows the cameras to transmit images to a computing device, such as for processing, analysis, storage, or some other purpose. In some cases, the cameras 102 have a wireless connection with a computing device and are configured to send images wirelessly.

Mounting sensors on a user's body removes the need for external sensors, allowing for the applications in mobile settings and improving the robustness of hand tracking. A common form factor for mounted sensor-based approaches is the use of gloves with integrated sensors. The gloves may incorporate various sensors to capture signals associated with the local motion of the palm and fingers of a user. The signals may be assembled into hand poses using various techniques. However, these types of gloves are not ideal for many reasons. For instance, sensor-laden gloves tend to be bulky, which can hinder dexterous and natural movements of the hand and can interfere with human-environment interactions.

In contrast, wrist-mounted sensors offer the unique opportunity for sensing hands in a ubiquitous and dexterity-enabling manner. Through experimentation by the current inventors, wrist-mounted devices have shown promise to record and reconstruct hand poses, gestures, and finger positions and recognize daily activities.

According to some embodiments, one or more cameras 102 may be mounted on a wristband 104, which may be a form fitting band. One of the challenges with such a camera location is self-occlusion of the hand. For example, a camera may not have one or more fingers of the user within the field of view of the camera 102. In some cases, only one or more fingertips may be available in a field of view of a camera. As used herein, unless otherwise stated, the term fingertip includes a distal phalanx or phalange or any portion thereof (e.g., a dorsal surface, a ventral surface, a distal end, or the like, or a combination thereof) of a finger, which may be a thumb. For example, a single camera mounted on the back of the wrist may not have any fingers or fingertips within a field of view. Nevertheless, the system 100 may be configured to continuously infer the positions of the fingers based upon visual information that is available. In some instances, one or more fingertips may move into or out of the field of view. Similarly, the contours of the back of the hand that are within the field of view provide surface indications associated with the pose of the hand and finger positions. The changing views allow the system to accurately and continuously infer the hand poses and finger positions. In some cases, the field of view of a single camera captures data associated with the back of a user's hand, such as the area of skin between the wrist and the knuckles. In other words, in some cases, a camera capturing images of the back of a user's hand may not have any fingers within its field of view. The system may still be able to determine hand poses and finger positions with a high degree of accuracy. In some embodiments, the system is able to determine hand poses and finger positions with high degree of accuracy from captured images, wherein those captured images comprising limited visual information such that the human eyes are not able to determine hand poses and finger positions from those captured images, or the human eyes can determine hand poses and finger positions with a very low accuracy, such as less than 30%, less than 20%, less than 10%, less than 5%, or less than 1% accuracy for a pose or for multiple poses on average. In some embodiments, the limited visual information from an image or a combination of the captured images for a pose or a finger position comprises information for no more than 4 fingers, no more than 3 fingers, no more than 2 fingers, no more than 1 fingers, or no fingers observable or identifiable by human eyes.

According to some embodiments, a synthetic hand dataset is created and used to verify hand poses. A deep neural network may be trained to recognize discrete shapes and contours of the portions of the hand within a field of view of the camera, and hand poses can be inferred by regressing the pose parameters based on occluded imaging data captured by the one or more cameras. In embodiments utilizing more than one camera, the rendered images from each of the cameras may be used to represent images of the hand from various angles and fields of view and the system may regress the pose parameters that are used to make up the images of the hand. After training a deep neural network, according to some experiments, the model has been shown to achieve a mean absolute error percentage of less than 12%.

FIG. 3 illustrates an example framework used to infer a hand pose. While the example of FIG. 3 utilizes four cameras mounted to a wrist of a user, it should be appreciated that any suitable number of cameras may be used with a high degree of accuracy, such as one, two, three, four, six, eight, ten or more cameras positioned on a wrist of a user. At a high level, the system employs a convolutional neural network 304 to regress one or more fingertip positions in three dimensions. In some embodiments, the convolutional neural network 304 is referred to as a deep convolutional network 304 or a backbone network 304.

More specifically, in some examples, the framework of FIG. 3 includes a deep model for fingertip prediction, and a kinematic model for full hand joint estimation. According to some embodiments, the framework will receive the image input 302 of the one or more cameras. In embodiments that include multiple cameras, the framework will receive images from the cameras that are captured at the same time. The images may include time stamp data to allow the system to accurately associate the images captured at the same time. The images may be sent to a deep convolutional network 304 that has been trained to predict the fingertip positions in three dimensions. The full set of hand joint angles may also be inferred by feeding the predicted fingertip positions into a kinematic model.

For each time step, the model may be configured to output an estimation of full hand joint positioning, which may enable continuous hand tracking. In some embodiments, images that are captured at the same time by multiple cameras may be stitched to create a three-dimensional hand model which can be used to predict a 15-dimension output (e.g., 3D coordinates of the five fingertips). In some cases, a 20-dimension output is created.

In some cases, the model includes a backbone network 304 and a regression network 308. According to some implementations, each of the image frames 302 is sent to the backbone network 304 where the image features are extracted independently. The features may further be concatenated at a batch normalizer 306 which may also rescale and/or involve channel reduction.

In some cases, each block within the convolutional neural network 304 includes several convolution operations. In some cases, each convolution operation is followed by a batch normalization 306 and rectified linear unit. A global average pooling may be performed at the end of the backbone network 304 to extract a vector representation of each image. The backbone network 304 may be pre-trained to recognize patterns within imagery for visual recognition.

The regression network 308 may include one, two, or more fully connected layers. In some embodiments, the regression network 308 maps the concatenated features into a 15-dimension output.

In some cases, the output of the regression network 308 may be compared with a known dictionary 310 to verify the three-dimensional hand pose 312.

In some cases, the framework of FIG. 3 is trained by using labels and ground truth hand pose information. In some cases, a separate model may be trained for each unique user to account for unique hand and finger sizes, poses, and gestures. In some cases, training involves mini-batch stochastic gradient descent and may further include momentum and descent. One of the difficulties in training models results from changing conditions during visual imaging. One way of training the model to deal with visual changes is by applying random color perturbation such as to mimic camera color distortion and light conditions changing during image capture. This type of color-specific training allows the trained model to be more robust and less susceptible to induced errors from environmental factors.

In some example embodiments, once the fingertip positions are determined, the rest of the joint angles may be inferred through inverse kinematics. For example, forward kinematics may be used to map from joint angles to fingertip positions to create a kinematic dictionary 310. In some cases, the inferred joint angles can be re-casted into a reverse lookup query in the dictionary 310. Once the dictionary 310 is populated, the retrieval of the hand position is efficient as the results satisfy biomechanical constraints of the hand.

FIGS. 4A, 4B, 4C, and 4D illustrate sample image and results on a synthetic data set. For example, FIGS. 4A and 4B illustrate a front view and an isometric left view, respectively, showing the fields of view from an embodiment device utilizing four cameras spaced equally about the wrist of a user. In some cases, the device includes a top camera 402 that is located on the back of the wrist of a user and is aimed at the back of the hand of the user. In some embodiments, additional cameras may be used, such as a right camera 404, a left camera 406, and/or a bottom camera 408. In some cases, only a single camera is used, which may be positioned as a top camera 402, a right camera 404, a left camera 406, or a bottom camera 408, or some other orientation about the wrist of a user. The fields of view of the cameras are shown as squares with each camera capturing a unique view. In some embodiments, the fields of view of two or more cameras may overlap, while in other cases, the fields of view may not overlap.

FIG. 4C is a representation of images captured by each of the four cameras. As can be seen, the images do not capture much information related to fingertips or finger positions. In fact, as illustrated, three of the images do not capture the fingers or fingertips at all. Nevertheless, fingertip prediction and hand pose inference can be very accurate based upon the information that is captured. Furthermore, a first capture from a first camera can be compared with a second capture from the first camera and based upon the biomechanical limitations of the hand joints, comparing a first capture with a second capture can provide valuable clues that increase the accuracy of a prediction. In some embodiments, the system captures one or more images by one or more imaging sensors, wherein the one or more images, or the combination of one or more images for a hand pose capture limited information related to hand joints, fingertips, and/or finger positions. In some embodiments, the limited information is for at most 10%, at most 15%, at most 20%, at most 25%, at most 30%, at most 40%, at most 50%, at most 60%, at most 70% or at most 80% of the hand joints and/or finger tips.

By utilizing the model, such as that described above with respect to embodiments herein, the system can infer hand poses. FIG. 4D illustrates a comparison between the ground truth 410 which is shown in cross hatching, and a predicted hand pose 412 shown without cross hatching. By iteratively training the model on ground truth and populating a dictionary, the inferred hand pose can be quite accurate, and testing has shown an accuracy of over 98% in many cases. In some embodiments, a 3D spatial position of the hand pose is constructed by predicting positions of at least a portion of the hand joints or at least a portion of the positions of finger tips including the hand joints and/or finger tips not captured by the images from the imaging sensor(s). In some embodiments, one or more images are captured by one or more imaging sensors and processed by a processor or microprocessor to predict a 3D spatial position for a hand pose of the user, and wherein the one or more images or their combinations capture limited visual information for less than 30%, less than 40%, less than 50%, less than 60%, or less than 70% of the hand joints and/or finger tips of a hand for the hand pose, and the predicted 3D spatial position comprises full information for the hand pose, or at least 80%, 90%, 95% or 100% information or accurate information for the hand joints and/or finger tips of the hand for the hand pose. The accuracy of the prediction for a hand pose is at least 88%, at least 90%, at least 92%, at least 95%, at least 96%, or at least 98%. In some embodiments, the prediction is based on a model comprising parameters of joints (e.g. length between joints, joint angles, 3D coordinates of joints, and/or any combination thereof), parameters of finger tips (e.g. 3D coordinates of finger tips), parameters of specific hand poses (e.g. generated by one or more learning models) and/or any combination thereof.

FIGS. 5A and 5B illustrate a kinematic model that may be used to infer the hand joint positions. A kinematic model that relies on images taken from the wrist is different than a model that relies on remotely mounted cameras. For example, a wrist-based system has fewer degrees of freedom, because the wrist is the origin and therefore removes the degrees of freedom ordinarily provided by the wrist joint. Accordingly, a simplified kinematic model may be employed for a wrist-mounted system that locates an origin at the wrist. In some cases, each of the four fingers (excluding the thumb) is parameterized by four joints and one fingertip. Three of the joints are located at the finger joints and one of them is at the wrist.

The joints and the degrees of freedom associated with them is illustrated in FIG. 5B. For example, the first digit 502 includes three joints, the metacarpophalangeal joint (MCP) 504, the proximal interphalangeal joint (PIP) 506, and the distal interphalangeal joint (DIP) 508. The MCP 504 has two degrees of freedom, being able to pivot about the joint toward the palm (i.e., flexion and extension), and moving the fingers towards or away from one another (i.e., adduction and abduction). The PIP 506 and DIP 508 each have a single degree of freedom, only able to pivot about the joint. The fingertips 510, are not capable of moving independently of the finger joints, and therefore have zero degrees of freedom.

The three-dimensional coordinates of the joints as X=(x_(k), y_(k), z_(k)) with i∈[0,4]. The joint nodes from the wrist to the fingertip are indexed from X₀ to X₄. The indexing of X is fully parameterized in the model by joint angles [θ₀, θ₁, θ₂, θ₃, θ₄], where θ₀ is the deflection angle when a finger is moving and θi (i>0) are the bending angles of each joint (i.e., Euler angles). Once the device is mounted, θ₀ becomes fixed. The relation between {θ₁} and {X_(i)} (i>0) are given by the forward kinematics as

x _(i) =x _(i−1)−(l _(i−1)·cos(θ_(i)))·sin(θ_(i−1))

y _(i) =y _(i−1)+(l _(i−1)·cos(θ_(i)))·cos(θ_(i−1))

z _(i) =z _(i−1)+(l _(i−1)·sin(θ_(i)))

Where l_(i−1) is the finger length between the joint i−1 and i. We may assume the finger joint length {l_(i)} is fixed as it does not have a large effect on hand poses. For the thumb finger, only three joints (i≤3) are available and the equations remain the same.

Because of limited and known phalanx biomechanical kinematics, the joint positions in the hand model may be constrained once the fingertip positions are known. The kinematic dictionary may capture the constraints which facilitates the inference of the joint positions. In some cases, the dictionary may be populated by enumerating the parameter space of [θ₀, θ₁, θ₂, θ₃, θ₄] for each finger with a small step size, and recording the corresponding finger positions X₄. The range of parameters are listed in Table 1 below.

TABLE 1 Range of the Joint Angles for Creating the Kinematic Dictionary Finger θ₀ θ₁ θ₂ θ₃ θ₄ Thumb [−90°, 0) 0 [−90°, 90°) [θ₂, θ₂ + 90°) [θ₃, θ₃ + 90°) Index [−30°, 0) [−90°, 90°) [θ₁, θ₁ + 90°) [θ₂, θ₂ + 60°) [θ₃, θ₃ + 45°) Middle 0 [−90°, 90°) [θ₁, θ₁ + 90°) [θ₂, θ₂ + 60°) [θ₃, θ₃ + 45°) Ring [0, 30°) [−90°, 90°) [θ₁, θ₁ + 90°) [θ₂, θ₂ + 60°) [θ₃, θ₃ + 45°) Little [0, 45°) [−90°, 90°) [θ₁, θ₁ + 90°) [θ₂, θ₂ + 60°) [θ₃, θ₃ + 45°)

In some embodiments, once the fingertip positions are determined in combination with a prepopulated kinematic dictionary, the joint parameters can then be retrieved and determined from the dictionary. For instance, θ*₀ can be determined by using the equation:

$\theta_{0}^{*} = {\arctan\left( \frac{x_{4} - x_{0}}{y_{4} - y_{0}} \right)}$

This is possible because, according to the kinematic model, there may only be a single solution of θ₀ for a given fingertip position. Given θ₀, we further search all values of [θ₀, θ₁, θ₂, θ₃, θ₄] to identify a set of parameters [θ*₀, θ*₁, θ*₂, θ*₃, θ*₄] that is closest to the given fingertip position. This can be done very efficiently using, for example, a locality sensitive hashing. The result [θ*₀, θ*₁, θ*₂, θ*₃, θ*₄] must satisfy kinematic constraints and may be used as the output estimation.

Similarly, fingertip positions can be determined for a second digit 512, a third digit, 514, and a fourth digit 516. From determined fingertip positions in three-dimensional space, the joint parameters for each of the fingers can be determined and combined into a 3D model to estimate hand pose.

To further enhance the output estimation, the model may determine fine-grained micro-finger poses. In some cases, the system may learn another deep model to classify these fine-grained micro-finger poses. Optionally, the trained weights from the pose estimation network may be leveraged to efficiently generate pose recognition. This may be done, for example, by utilizing a similar network architecture as described with respect to embodiments herein, although a last layer may be replaced by a fully connected layer, which may be supervised by cross entropy loss for classification. In some cases, the penultimate layer may be dropped, and optionally, the first three blocks of the backbone may be frozen such as to prevent over-fitting.

FIG. 6 illustrates continuous hand pose estimation results 600 using embodiments as described herein. In some examples, the inventors have discovered that continuously reconstructing hand postures is possible without the need of seeing all the fingers. Moreover, the hand poses can be determined with only outline imaging data. For example, in some cases, one or more thermal imaging cameras were used to capture outline images (e.g., silhouettes) from a wrist-mounted system, and the system was able to accurately estimate the hand position. In some cases, the hand position was estimated from low resolution imagery, such as on the order of 32×24 pixels.

The hand pose estimation results 600 are shown in columns that represent each time step of imaging. The first row represents ground truth 602, such as the actual hand pose as observed by a human. The second row represents a predicted hand pose 604 based upon embodiments such as those described herein with reference to FIGS. 1-5. The third row 606 represents a view from a top wrist-mounted camera. The fourth row 608 represents a view from a right-side wrist-mounted camera. The penultimate row 610 represents a view from a bottom-side wrist-mounted camera. The last row 612 represents a view from a left-side wrist-mounted camera. In some examples, the views of the top, right, bottom, and left view are captured by thermal imaging cameras, while in other embodiments, the views represent images captured by CCDs or CMOS sensors. In some cases, the images are low resolution, such as less than about 64×64 pixels, and in some cases, represent silhouette images.

In some embodiments, a single camera is used to capture a single field of view. Through training a model, images capturing a single field of view can be used to infer and predict hand poses and gestures with a relatively high degree of accuracy. In some cases, a single camera mounted to the back of the wrist can be used to predict hand pose and gesture data with a degree of accuracy above 80%, or above 85%, or above 90%, or in some cases above 95%.

In some embodiments, a system is configured to capture images of a hand from a wrist-mounted system comprising one or more cameras. In some cases, the wrist-mounted system captures images of less than all of the fingers. In some cases, the wrist-mounted system captures image data associated with one finger, or two fingers, or three fingers, or four fingers. From the images containing less than all of the fingers, the system is able to determine a hand pose, including all of the fingers. In some embodiments, a large percentage of the captured images contain surface curvature of the hand of a user, and in some cases, captures predominantly the back of the hand of a user. For example, in some cases, greater than 90% of the captured hand image data is of the back of the hand, and less than 10% of the captured image data includes one or more fingers of a user. In other cases, greater than 80% of the captured hand image data is of the back of the hand, and less than 20% of the captured hand image data includes one or more fingers of a user. As an example, where an image captures data associated with a hand, the image data associated representing the hand (e.g., excluding background image data) is the captured hand image data. In some cases, greater than 70% of the captured hand image data includes the back of the hand, while less than 30% of the captured hand image data includes image data associated with one or more fingers of a user. The result is that a system that largely captures image data associated with the back of the hand of a user provides valuable information for inferring hand pose and gesture, even where the captured hand image data largely lacks data associated with the fingers of a user.

FIGS. 7A and 7B illustrate sample images captured from a wrist-mounted imaging sensor and a predicted hand pose based on the images, respectively. In block A, a first image is captured by a first camera showing a first field of view. In block B, a second image is captured by a second camera showing a second field of view. In block C, a third image is captured by a third camera showing a third field of view. In block D, a fourth image is captured by a fourth camera showing a fourth field of view. As illustrated, only block A and block C include any meaningful captured hand image data that includes visible fingers. Notably, even isolating block B that does not include any captured hand image data that includes fingers, the system can use this captured hand image data to infer hand poses and finger positions.

FIG. 7B illustrates a predicted hand pose based upon the image data of FIG. 7A. Notably, the discrete data points represent portions of fingers based upon a kinematic model and a determined position of the five fingertips in three-dimensional space.

One of the benefits of the embodiments described herein is the capability to reconstruct the entire range of hand poses (e.g., 20 finger joint positions) by deep learning the outline shape of the hand by one or more cameras located close to the wrist of a user. As used herein, close to the wrist refers to a distance between the skin of a user and the optical axis of the camera. In some cases, the distance is within the range of from 2 mm to about 10 mm, or from 3 mm to about 5 mm. In some cases, the distance is about 2 mm, 3 mm, 4 mm, 5 mm, 6 mm, 7 mm, or 8 mm.

In some embodiments, one or more cameras capture less than all of the fingers at a framerate of about 16 Hz, or about 20 Hz, or about 22 Hz, or about 28 Hz, or about 30 Hz.

FIG. 8 illustrates an example embodiment of a low-profile wrist-mounted system 100 for continuously estimating a hand pose of a user. The system 100 may use one or more cameras 102 secured to a mount 104. The system 100 may be considered low-profile, meaning that the dimensions of the system locate the cameras 102 in close proximity to the wrist of a user. The cameras may have an optical axis 802 that defines the center of the lens along the center of a focal direction of the camera. The optical axis 802 may be spaced a distance d₁ away from the wrist of a user. The distance d₁ may be on the order of less than 3 mm, or less than 4 mm, or less than 5 mm, or less than 6 mm, or less than 7 mm.

Similarly, the outer periphery of the system 100, (e.g., the location on the system that is furthest away from the wrist of the user), may be spaced a distance d₂ from the wrist of the user. The distance d₂ may be on the order of about less than 5 mm, or less than 8 mm, or less than 10 mm, or less than 12 mm, or less than 18 mm, or less than 20 mm.

In some cases, the optical axis 802 may be spaced a distance away from the mount (e.g., wearable band) a distance less than about 1 mm, or less than about 2 mm, or less than 4 mm, or less than 5 mm, or less than 7 mm. A low-profile mount thus allows the optical axis to be spaced a distance away from the skin of the user which allows the system to be nonobtrusive, which is a significant advantage over prior system. However, such a low-profile system creates additional considerations resulting from the available field of view from this perspective that is very close to the skin of a user. The resulting field of view will be largely occluded by the hand of the user and it may be more difficult to acquire hand image data that includes finger image data.

The system 100 may be combined with other sensors to provide additional details regarding hand poses or gestures. As an example, one or more inertial measurement units (IMU) may be combined with the systems described herein to provide motion data of a hand or arm in combination with a hand pose or gesture. Similarly, a first system 100 may be worn on a first wrist of a user and a second system 100 may be worn on a second wrist of the user. The first and second systems may each independently infer hand poses and gestures of the respective hands to which the systems are imaging. However, the poses and/or gestures of each of the two hands may be combined together to recognize two-handed discrete poses and gestures.

While the system 100 illustrated in FIG. 8 shows 4 cameras, it should be appreciated that a single camera 102 can be used to provide meaningful data to infer hand pose and gestures. Through experimentation, the mean average error (MAE) can be reduced by adding additional cameras beyond a single camera; however, very acceptable results can be achieved using a single camera. A single camera may be mounted on the inside of the wrist (e.g., imaging the palm side of the hand), or on the back of the wrist imaging the back of the hand. Additional cameras can be added to reduce the MAE; however, there is a point of diminishing returns that was exhibited beyond about 4 cameras. That said, increased accuracy can be found by utilizing 5, 6, 7, or 8 cameras in a wrist-mounted camera system.

Through experimentation and learning the model, it has been shown that a low profile camera that captures images in which the fingers and fingertips are largely occluded by the hand of the user continues to provide valuable data for accurate hand pose and gesture recognition.

According to examples, a single camera capturing deformation of the skin located on the back of the hand provides sufficient data for acceptable accuracy. In some experiments, a single camera capturing skin deformation (e.g., captured hand image data where fingers were largely occluded), provided sufficient information to provide hand pose tracking with an accuracy of greater than 75%, or greater than 78%, or greater than 80%.

In some examples, one or more thermal imaging cameras 102 were secured to a mount 104. In some examples, the thermal imaging cameras 102 had a temperature sensitivity of about ±1° C. The thermal imaging cameras 102 had a framerate of 16 Hz, a resolution of 32×24 pixels, and a field of view of 110°. Each camera was in communication with a remote computing device for receiving and processing the image data. A time matching algorithm may be used to synchronize the image capture in a device using more than one camera to encourage all the image frames for a given time step to be captured at the same time.

In some cases, the device may be calibrated for each unique user. For example, a user may be asked to perform predefined gestures to calibrate the device each time the user wears the device. In some cases, the system may instruct a user on how to adjust the device for a subsequent use to ensure that the field of view will be comparable to a calibration field of view to increase accuracy of predictions and inferences.

According to some embodiments, the device 100 may include a power supply, which may be any suitable power supply, such as, for example, a battery, a solar cell, a kinetic energy harvester, a combination of power supplies or other power supply which may provide power for the cameras 102, and in some cases, one or more processors.

In some embodiments, the device may be in communication with a remote computing device that is configured to receive, analyze, and/or process the image data such as to apply the deep neural network to estimate hand pose and gestures. The communication may be wireless, wired, or a combination. As with any embodiment, the one or more cameras may be depth sensing, infrared, RGB, thermal spectrum, hyper spectrum, or some other camera type or combination of camera types.

FIG. 9 illustrates another form factor of an embodiment of a low-profile wrist-mounted system 100 for continuously estimating a hand pose of a user. The system 100 may resemble a wristwatch and include a camera 102 and a mount 104. The camera 102 may be incorporated into the watch and be located to capture images of the user's hand when the watch is worn. In some embodiments, the system includes two cameras positioned on opposite sides of the watch such that at least one camera will be facing the user's hand whether the system 100 is worn on the left hand or the right hand. The system may be as described herein with respect to any embodiment. In some cases, additional cameras may be secured to the mount and positioned to capture images of the user's hand when the device is worn on the wrist.

The system 100 may include a power supply, such as a battery, and may further include one or more processors. In some cases, the system 100 includes a communications system for sending/receiving data to/from a remote computing device.

FIG. 10 illustrates a method 1000 for determining hand poses and gestures. At block 1002, the system receives one or more first images from one or more imaging sensors mounted to a wrist. The images may include limited finger position data, but rather, include surface contours of the back of the wrist or hand.

At block 1004, the system determines, based at least in part on the one or more first images, a 3D position of one or more fingertips. This may be performed, for example, by the deep neural network, as described herein.

At block 1006, the system determines, based at least on the 3D position of the one or more fingertips, a pose of the hand.

At block 1008, the system receives one or more second images from the one or more imaging sensors.

At block 1010, the system determines, based at least in part on the one or more second images, a second pose of the hand. The second pose of the hand may be compared to the first pose of the hand and the system may recognize a gesture associated with the change from the first pose to the second pose.

In the described implementations, the system 100 may include the processor(s) and memory. In various embodiments, the processor(s) may execute one or more modules and/or processes to cause the imaging sensors (e.g., cameras) to perform a variety of functions, as set forth above and explained in further detail in the disclosure. In some embodiments, the processor(s) may include a central processing unit (CPU), a graphics processing unit (GPU), both CPU and GPU, or other processing units or components known in the art. Additionally, each of the processor(s) may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems. The processor may include multiple processors and/or a single processor having multiple cores. The computing device may comprise distributed computing resources and may have one or more functions shared by various computing devices. In some instances, the imaging sensors are in communication with one or more computing devices through wired or wireless communication. The imaging sensors may be powered by a battery pack, which may be carried by the user, such as by the wearable band, or may be wired to receive power from another location or device.

The computing device may have memory which may include computer-readable storage media (“CRSM”), which may be any available physical media accessible by the processor(s) to execute instructions stored on the memory. In some implementations, CRSM may include random access memory (“RAM”) and Flash memory. In other implementations, CRSM may include, but is not limited to, read-only memory (“ROM”), electrically erasable programmable read-only memory (“EEPROM”), or any other medium which can be used to store the desired information and which can be accessed by the processor(s). The memory may include an operating system, one or more modules, such as a video capture module, a video stitching module, a deep neural network module, a kinematic module, a sensor data module, and a sensor data analysis module, among others.

The imaging sensors may include one or more wireless interfaces coupled to one or more antennas to facilitate a wireless connection to a network. The wireless interface may implement one or more of various wireless technologies, such as Wi-Fi, Bluetooth, radio frequency (RF), and so on. In some instances, the imaging sensors are coupled to a wired connection to one or more computing devices to facilitate a wired connection for transmitting imaging data to the computing devices.

The imaging sensors may be any suitable image sensor, such as visible spectrum camera, thermal spectrum camera, infrared spectrum camera, or a combination. In some embodiments, the hand and/or finger pose may be used to control a virtual reality system by determining gestures, hand poses, hand movements, and the like. In some embodiments, the system is able to identify at least 10, or at least 12, or at least 15, or at least 20 different hand poses. In some cases, the system determines hand poses with an accuracy of at least 85%, at least 88%, at least 90%, at least 92%, at least 93%, at least 95%, or higher. As used herein, the term “hand pose” is a broad term and is used to indicate any position, orientation, flexure, or configuration of a hand and fingers, including roll, pitch, and yaw axes of the hand as well as finger abduction, adduction, flexion, extension, and opposition.

A number of non-limiting examples of illustrative embodiments will now be described.

Example 1 is a system including a wearable band comprising one or more pieces extending longitudinally along a first axis and operable to be coupled to an arm of a user, one or more imaging sensors, the one or more imaging sensors comprising at least a first imaging sensor disposed at a first position on the wearable band, the first imaging sensor disposed to have a first field of view that is lateral to the first axis and is anatomically distal when the wearable band is coupled to an arm of the user. The first imaging sensor defines an optical axis spaced a distance from the wearable band, the distance being less than about 10 mm.

Example 2 includes the subject matter of Example 1, wherein the one or more imaging sensors further comprise a second imaging sensor disposed on the wearable band at a second position, the second imaging sensor disposed to have a second field of view that is lateral to the first axis and is anatomically distal when the wearable band is coupled to an arm of the user, wherein the second field of view is different from the first field of view.

Example 3 includes the subject matter of Example 1 or Example 2, the one or more imaging sensors further comprising a third imaging sensor disposed on the wearable band at a third position, the third imaging sensor disposed to have a third field of view that is lateral to the first axis and is anatomically distal when the wearable band is coupled to an arm of the user, wherein the third field of view is different from the first field of view and the second field of view.

Example 4 includes the subject matter of any of Examples 1-3, the one or more imaging sensors further comprising a fourth imaging sensor disposed on the wearable band at a fourth position, the fourth imaging sensor disposed to have a fourth field of view that is lateral to the first axis and is anatomically distal when the wearable band is coupled to an arm of the user.

Example 5 includes the subject matter of any of Examples 1-4, wherein the first field of view comprises a dorsal aspect of a hand of a user, a dorsal aspect of a hand of a user and at least a portion of a range of motion of a proximal phalanx of at least one finger (which may be a thumb) of a hand of a user, or a dorsal aspect of a hand of a user and at least a portion of a range of motion of a first phalanx (e.g., a proximal phalanx) and a second phalanx (e.g., a middle phalanx or a distal phalanx) of at least one finger (which may be a thumb) of a hand of a user, when the wearable band is coupled to an arm of the user.

Example 6 includes the subject matter of any of Examples 1-5, wherein the first field of view of the first imaging sensor comprises a first dorsal aspect of a hand of a user and the second field of the second imaging sensor comprises a second dorsal aspect of a hand of a user and, optionally, wherein the first field of view, the second field of view, or both the first field of view and the second field of view comprise both a dorsal aspect of a hand of a user and at least a portion of a range of motion of a proximal phalanx of at least one finger (which may be a thumb) of a hand of a user or at least a portion of a range of motion of both a first phalanx (e.g., a proximal phalanx) and a second phalanx (e.g., a middle phalanx or a distal phalanx) of at least one finger (which may a thumb) of at least one finger of a hand of a user, when the wearable band is coupled to an arm of the user.

Example 7 includes the subject matter of any of Examples 1-6, wherein the first field of view of the first imaging sensor comprises a first dorsal aspect of a hand of a user and the second field of the second imaging sensor comprises a palmar aspect of a hand of a user when the wearable band is coupled to an arm of the user.

Example 8 includes the subject matter of any of Examples 1-7, further comprising a computing device operable to communicate with the first imaging sensor via a hardwired or a wireless communication pathway and operable to receive image data from the first imaging sensor.

Example 9 includes the subject matter of any of Examples 1-8, and further comprises a computing device operable to communicate with the first imaging sensor and the second imaging sensor via a hardwired or a wireless communication pathway and operable to receive image data from the first imaging sensor and the second imaging sensor and a stitching module comprising a stitching model executable by the computing device to stitch received image data from the first imaging sensor and the second imaging sensor for a correlated time to create a stitched image at the correlated time and, optionally, to stitch received image data from the first imaging sensor and the second imaging sensor for a plurality of correlated times to create a plurality of stitched images corresponding, respectively, to the plurality of correlated times.

Example 10 includes the subject matter of any of Examples 1-9, and further comprises a 3D prediction module comprising a 3D prediction model executable by the computing device and configured to analyze the plurality of stitched images and to approximate or to determine a position of one or more fingertips of a user at the plurality of correlated times.

Example 11 includes the subject matter of any of Examples 1-10, and further comprises a kinematic module comprising a kinematic model executable by the computing device to determine or approximate, based at least in part on the approximated or determined position of one or more fingertips, a position and orientation of hand joint angles including one or more of a metacarpophalangeal joint angle, a proximal interphalangeal joint angle, a distal interphalangeal joint angle, or a radiocarpal joint angle.

Example 12 includes the subject matter of any of Examples 1-11, wherein the distance is less than about 5 mm.

Example 13 includes the subject matter of any of Examples 1-12, and further comprises a processing device operable to process image data from the one or more imaging sensors, the image data comprising spatial information associated with less than 50% of hand joints and/or finger tips of a hand pose and to predict from the image data, spatial information for a remainder of the hand joints and/or finger tips of a hand pose of a user.

Example 14 is a method comprising the act of receiving image data from a wrist borne wearable device bearing a plurality of imaging sensors disposed to image at least a portion of a hand of a user excluding a distal phalanx of one or more fingers of the user, the image data comprising image data from each of the plurality of imaging sensors at a first time, an optical axis of each of the plurality of imaging sensors being spaced from an outer surface of the wearable device by less than about 10 mm. Example 14 also includes the act of determining, from the image data, a position of one or more distal phalanxes or fingertips of the user at the first time and the act of determining, from the image data and the determined position of the one or more distal phalanxes or fingertips of the user, a pose of the hand of the user at the first time.

Example 15 includes the subject matter of Example 14, wherein the determining of the position of the one or more distal phalanxes or fingertips of the user is performed by a convolutional neural network.

Example 16 includes the subject matter of Example 14 or Example 15, wherein the determining of the pose of the hand of the user is performed by inference with a skeletal and kinematic model.

Example 17 includes the subject matter of any of Examples 14-16, wherein the pose of the hand of the user includes a metacarpophalangeal joint angle, a proximal interphalangeal joint angle, a distal interphalangeal joint angle, a radiocarpal joint angle, or any combination thereof.

Example 18 includes the subject matter of any of Examples 14-17, and further comprises the acts of receiving image data from the plurality of imaging sensors at a second time and determining a pose of the hand of the user at the second time.

Example 19 is a method for tracking the position of a hand comprising the act of receiving one or more first images from one or more imaging sensors of a wrist borne wearable device, the one or more first images including image data of one or more portions of a user's hand, but excluding image data of one or more distal phalanxes or of one or more fingertips. Example 19 also includes the act of determining, based on the one or more first images, a 3D spatial position of one or more portions of the user's hand that are not represented in the one or more first images. Example 19 also includes the act of determining a pose of the hand based on the determined 3D spatial position.

Example 20 includes the subject matter of Example 19, wherein the determining of the 3D spatial position of the one or more portions of the user's hand that are not represented in the one or more first images is performed using a deep neural network.

Example 21 includes the subject matter of Example 19 or Example 20 and further comprises receiving a plurality of first images from a plurality of imaging sensors of a wrist borne wearable device, the plurality of first images including image data of one or more portions of a user's hand, but excluding image data of one or more distal phalanxes or of one or more fingertips. Example 21 further includes the acts of stitching at least some of the plurality of first images to create a stitched plurality of first images and determining, based on the stitched plurality of first images, a 3D spatial position of one or more portions of the user's hand that are not represented in the plurality of first images. Example 21 optionally includes the act of determining a pose of the hand based on the determined 3D spatial position.

Example 22 includes the subject matter of any of Examples 19-21, wherein the determining of the pose of the hand is performed, at least in part, using a kinematic model that infers the pose of the hand based, at least in part, on the 3D spatial position of the one or more portions of the user's hand that are not represented in the one or more first images.

Example 23 includes the subject matter of any of Examples 19-22, wherein the plurality of first images comprises image data for only a dorsal aspect of the hand of the user.

Example 24 includes the subject matter of any of Examples 19-23, wherein the plurality of first images comprises image data for a dorsal aspect of the hand of the user and at least a portion of a proximal phalanx of at least one finger (which may be a thumb) of the hand of the user.

Example 25 includes the subject matter of any of Examples 19-24, wherein the plurality of first images comprises image data for a dorsal aspect of the hand of the user, at least a portion of first phalanx (e.g., a proximal phalanx) of at least one finger (which may be a thumb) of the hand of the user, and at least a portion of and a second phalanx (e.g., a middle phalanx or a distal phalanx) at least one finger (which may be a thumb) of the hand of the user.

Additional illustrative embodiments relating to machine learning based activity detection utilizing reconstructed 3D arm postures will now be described with reference to FIGS. 11 through 15. These embodiments, like the other embodiments disclosed herein, are presented by way of illustrative example only, and should not be construed as limiting in any way.

These embodiments include a system referred to herein as EatingTrak, a wearable technology illustratively implemented using one or more wearable devices. In some embodiments, the one or more wearable devices include a commodity smartwatch, although it is to be appreciated that additional or alternative wearable devices, such as a smartphone and/or an instrumented ring, can be used in other embodiments.

The EatingTrak system in some embodiments can detect fine-grained eating and drinking activities at a per-intake level in a lab environment, as well as in semi-wild and free-living scenarios. EatingTrak reconstructs 3D arm postures using one or more IMUs from the one or more wearable devices (e.g., smartwatch), and then recognizes fine-grained eating activities from the series of estimated 3D arm postures. To validate this example system, an Apple Watch was used to collect data for in-lab, semi-wild, and free-living eating scenarios with 12 participants, where videos were recorded for ground truth. The results showed that EatingTrak was able to identify eating intake events with an average F₁ score of 92.2%, 77.0% and 58.5% in the in-lab, semi-wild, free-living scenarios, respectively. It also classified eating intake events for 4 types of utensil and 7 types of food with an average accuracy of 83.7% and 64.6% respectively.

American adults and children are increasingly suffering from obesity, which has been known to be related to a variety of health issues, including stroke, type II diabetes, heart disease, and certain types of cancers.

As indicated previously, the traditional approach for food journaling usually requires the user to manually log the eating activities (e.g., on a smartphone app), which depends on the user's self-motivation and determination, and is therefore known to be unsustainable in practice. To release the users from tedious self-report tasks and get a more accurate estimate on eating activities, a variety of computing technologies have been developed. Most of the existing systems only work well in a controlled environment but, unfortunately, fail to provide satisfactory accuracy (when the user eats) and resolution (what and how the user eats) in distinguishing different eating activities in the wild. Identifying eating activity from other activities in everyday living can be highly challenging without enough contextual information, because human activities usually involve complicated body movements. In order to eat, one needs to move his or her hand down to pick up the food, then move it up towards the mouth, and rotate the wrist to put the food into the mouth. This series of arm movements constructs a unique temporal pattern in 3D body space as the food intake gesture.

Illustrative embodiments herein provide significant improvements over conventional approaches at least in part by utilizing contextual information, such as the 3D arm posture, to make a just-in-time prediction, as the 3D arm posture intuitively distinguishes eating behaviors from others.

Advantageously, the EatingTrak system significantly improves accuracy and resolution in recognizing eating activities without requiring new hardware. The system is configured to recognize eating activities at intake-level by learning with reconstructed 3D arm postures from one or more wearable devices, such as a commodity smartwatch.

In some embodiments, EatingTrak estimates the arm pose in the wild using a commodity smartwatch. The system implements a machine learning pipeline that can identify eating moments, and recognize utensils and food types, from 3D arm posture data. As indicated above, advantageous performance of the system was confirmed through a user study with 12 participants on eating activity recognition in both in-lab, semi-wild, and free-living conditions.

FIG. 11 illustrates an information processing system 1100 implementing one or more embodiments of the disclosed techniques.

The system 1100 comprises a weighted dictionary generation subsystem 1102 coupled to an arm pose estimation and activity detection subsystem 1104. The weighted dictionary generation subsystem 1102 generates a weighted dictionary that is utilized in the arm pose estimation and activity detection subsystem 1104 to reconstruct arm poses for use in activity detection, as will be described in more detail below.

The subsystem 1104 of system 1100 in this illustrative embodiment can be generally viewed as implementing a pipeline comprising four distinct processing stages in recognizing eating activities using data obtained from an IMU: wrist orientation detection, body direction estimation, arm pose reconstruction and machine learning task classification.

Wrist-orientation data for the example user study was gathered from an Apple Watch's in-built IMU sensor. This variable is used throughout the pipeline.

The first step in estimating body direction is to iterate through all possible directions. For each possible body direction, wrist orientation is calculated relative to the torso coordinate system using the wrists orientation relative to the earth coordinate system and body direction. The given wrist orientation relative to torso coordinates is then looked up in the weighted dictionary. The weighted dictionary was previously generated using actual arm postures, outputting a weighted point cloud. With the point cloud, a probability is assigned to the body direction candidate in question by summing up the point cloud's weights. The body direction with the highest probability is selected.

Once a body direction estimation is determined, the wrist orientation relative to the earth coordinate system can be directly transformed to derive the wrist orientation relative to the torso coordinate system. The derived wrist orientation relative to torso coordinate system values can then be searched in the weighted dictionary to get a weighted point cloud. All points in the point cloud can then be averaged based on their weights to get an arm posture estimation, which is used as the reconstructed arm posture. Such reconstructed arm postures may be viewed as examples of what are more generally referred to herein as “arm pose estimates.”

For final task classification, the example segmentation algorithm extracts possible intake gestures into segments, and resamples each segment to have an at least substantially uniform length. A support vector machine (SVM) model is used to classify whether or not that segment is an intake gesture. For the in-lab study, utensil type and food type are also predicted.

Pre-Pipeline Preparation

Coordinate System

FIG. 12 illustrates an example torso coordinate system when considering the movements of the user's left arm as follows:

origin point: the user's left shoulder

+x axis: from left shoulder to right shoulder

+y axis: facing direction

+z axis: upward direction perpendicular to the x-y plane

For the movement of the right arm, the coordinate system on the left arm is mirrored to simplify data processing.

Previous work has modeled human arms with 7 degrees of freedom (DoF), namely, 3 DoFs for the shoulder, 2 DoFs for the elbow and 2 DoFs for the wrist. However, the DoFs for the wrist cannot be detected with a smartwatch and were excluded from the model of the example user study. The remaining 5 DoFs, denoted as θ₁, θ₂, θ₃, θ₄, θ₅, are demonstrated in FIG. 12.

With the upper arm length (l_(u)), forearm length (l_(f)), and the 5 DoFs, the posture of the entire arm can be uniquely predicted using the Denavit-Hartenberg transformation. Arm posture generated by the transformation can be defined by wrist orientation, wrist location, and elbow location.

Dictionary Generation

In order to recover arm posture from the IMU data of a wrist-mounted smartwatch, a map from wrist orientation to arm posture is established. However, directly solving the inverse kinematics problem of recovering arm posture from wrist orientation can be challenging. Since only a limited number of arm posture positions cause a specific wrist orientation, the correlation between wrist orientations and arm postures is established by iterating through all valid combinations of the arm's 5 DoFs. This process will exhaust all possible arm postures and their corresponding wrist orientations. Thus, a dictionary of arm postures indexed by wrist orientations can be established.

Based on the example coordinate system used in the study, the zero position, θ₁=θ₂=θ₃=θ₄=θ₅=0 is defined as the position where the user's left arm falls freely to the left side of the body, with palm facing towards the body. However, other coordinate systems could be advantageously used in association with the concepts disclosed herein.

Table 2 below summarizes five empirical ranges of motion corresponding to the 5 DoFs of the arm used in some embodiments herein.

TABLE 2 Ranges of Motion Corresponding to the 5 DoFs DoF Range θ₁ [−60°, 180°] θ₂ [−40°, 120°] θ₃ [−30°, 120°] θ₄    [0°, 150°] θ₅ [−90°, 90°] 

Since the 5th DoF does not influence the location of the elbow and the wrist, iteration is only over the first 4 DoFs. Combined with the orientation obtained from the sensor, the first 4 DoFs are adequate to recover the whole arm's posture. By iterating over the first 4 DoFs two degrees at a time, approximately 77 million combinations were gathered. However, in reality, the arm's DoFs are coupled with one another causing many combinations to be invalid. Nonetheless, not all valid combinations can be obtained by iterating over these ranges of motion.

To cover as many valid arm postures as possible while also eliminating invalid arm poses, the ranges of motion of the 5 DoFs for the elbow and wrist movement were enlarged, which consequently expanded the number of combinations from 77 million to 228 million. Next, the method disclosed in I. Akhter et al., “Pose-Conditioned Joint Angle Limits for 3D Human Pose Reconstruction,” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) 2015, which is incorporated by reference herein in its entirety, was used to check whether the pose was valid. It uses a database of professional athletes' joint limits to verify each combination and eliminate impossible positions. 77% of combinations can be removed by applying this validation check function, resulting in a final dictionary size of approximately 51 million.

In some examples, the dictionary is indexed by wrist orientation, storing the 4 DoFs that can generate that particular orientation. Given a specific wrist orientation Rot_(w), a set of possible arm postures described as DoFs can be derived from the dictionary. The first 4 DoFs can be obtained directly from the dictionary, and θ₅ can be calculated using θ₁, θ₂, θ₃, θ₄ and Rot_(w). A point cloud of elbow and wrist location points can then be calculated from DoFs.

Weighted Dictionary

A simple averaging algorithm assumes that all points in the point cloud have an equal possibility of appearance. However, this method does not best represent the actual distribution of the points in the dictionary.

Different activities can elicit a variety of distinct distributions. Provided herein is a novel algorithm to estimate the probability distribution in the point cloud. Arm postures are collected during a specific activity, and the probability distribution in the point cloud is updated at each wrist orientation. An independent IMU sensor was used to collect the movement of the upper arm. With orientations derived from sensors located in upper arm and lower arm, the entire arm posture can be uniquely determined.

Let the observed arm posture at a given moment be denoted as (Rot_(e) _(g) , Rot_(w) _(g) ), by looking Rot_(w) _(g) up in the dictionary, the DoF cloud D={DoFs₁, DoFs₂, . . . , DoFs_(n)} indicating all possible arm postures can be obtained. The corresponding elbow orientations {Rot_(e) ₁ , Rot_(e) ₂ , . . . , Rot_(e) _(n) } can be calculated from the DoFs. Weights are then assigned to the corresponding DoF in the dictionary based on the distance between Rot_(e) _(g) and Rot_(e) _(i) .

w(DoF_(i))=θ(dis(Rot_(e) _(g,Rot) _(e) _(i) ))

where ƒ is the Gaussian distribution function and dis is determined by the angle between the axes of the two rotation matrices. Since the number of observed arm postures is very small compared to the size of the dictionary, neighbors of Rot_(w) _(g) can advantageously be included when querying point cloud from the dictionary to better utilize the limited amount of observations.

Arm postures were collected during eating periods of an individual in the lab for approximately 6 hours, and a weighted dictionary was established using those observations. That weighted dictionary was then used to perform arm posture reconstruction and body direction estimation on other participants during the user study.

Wrist Orientation Detection

The orientation obtained from the smartwatch is measured against the Earth Coordinate System (ECS). Assuming that the user's facing direction is constant during the eating period, ECS has a constant transformation to the Torso Coordinate System (TCS). The Wrist Coordinate System (WCS) and Sensor Coordinate System (SCS) are respectively defined as the coordinate system of the wrist and the sensor mounted on the wrist. These two coordinate systems change over time as the user moves his or her wrist. The orientation of WCS has the same configuration as that of TCS at zero position. The SCS is defined by the specific hardware setup of the sensor.

The base vectors of ECS and TCS are denoted as ({right arrow over (E)}_(x), {right arrow over (E)}_(y), {right arrow over (E)}_(Z)) and ({right arrow over (T)}_(x), {right arrow over (T)}_(y), {right arrow over (T)}_(z)) respectively and the base vectors of the ECS, TCS, WCS, SCS at timestamp t are denoted as ({right arrow over (w)}_(x) ^(t), {right arrow over (w)}_(y) ^(t), {right arrow over (w)}_(z) ^(t)) and ({right arrow over (s)}_(x) ^(t), {right arrow over (s)}_(y) ^(t), {right arrow over (s)}_(z) ^(t)) respectively.

Although the sensor was mounted on the wrist in the user study, the coordinate system setup of SCS and WCS may not be exactly aligned because users may wear their watches in different styles. The transformation between SCS and WCS can be described by matrix P.

({right arrow over (w)} _(x) ^(t) ,{right arrow over (w)} _(y) ^(t) ,{right arrow over (w)} _(z) ^(t))=({right arrow over (s)} _(x) ^(t) ,{right arrow over (s)} _(y) ^(t) ,{right arrow over (s)} _(z) ^(t))P

When the wrist with the mounted sensor moves, the sensor orientation at moment t relative to ECS can be described as rotation matrix R_(t).

({right arrow over (s)} _(x) ^(t) ,{right arrow over (s)} _(y) ^(t) ,{right arrow over (s)} _(z) ^(t))=({right arrow over (E)} _(x) ,{right arrow over (E)} _(y) ,{right arrow over (E)} _(z))R _(t)

R_(t) can be derived from the Euler angle or quaternions from the sensor. Assuming that the direction the user is facing remains constant during the eating period, TCS and WCS are identical at the zero position.

({right arrow over (w)} _(x) ⁰ ,{right arrow over (w)} _(y) ⁰ ,{right arrow over (w)} _(z) ⁰)=({right arrow over (s)} _(x) ⁻⁰ ,{right arrow over (s)} _(y) ⁻⁰ ,{right arrow over (s)} _(z) ⁻⁰)P=({right arrow over (E)} _(x) ,{right arrow over (E)} _(y) ,{right arrow over (E)} _(z))R ₀ P=({right arrow over (T)} _(x) ,{right arrow over (T)} _(y) ,{right arrow over (T)} _(z))

Therefore, at a given moment in time t, the orientation of wrist relative to the torso can be described as

({right arrow over (w)} _(x) ^(t) ,{right arrow over (w)} _(y) ^(t) ,{right arrow over (w)} _(z) ^(t))=({right arrow over (s)} _(x) ^(t) ,{right arrow over (s)} _(y) ^(t) ,{right arrow over (s)} _(z) ^(t))P=({right arrow over (T)} _(x) ,{right arrow over (T)} _(y) ,{right arrow over (T)} _(z))P ⁻¹ R ₀ ⁻¹ R _(t) P

So the orientation of the wrist can be denoted as

Rot_(w) =P ⁻¹ R ₀ ⁻¹ R _(t) P

A calibration process is performed in order to measure R₀. In some examples, the user is asked to stand or sit still, with his or her arm at the zero position for a few seconds. The matrix P can be determined based on the way the user wears the watch.

After deriving Rot_(w), a set of all possible DoFs D={DoFs₁, DoFs₂, . . . , DoFs_(n)} can be obtained by looking up Rot_(w) in the dictionary. A point cloud C={(loc_(e) ₁ ,loc_(w) ₁ ), (loc_(e) ₂ ,loc_(w) ₂ ), . . . , (loc_(e) _(n) ,loc_(w) _(n) )} can then be calculated from the corresponding DoF, where loc_(e) _(k) and loc_(w) _(k) ∀k=1, 2 . . . , n denotes the coordinates in the TCS of elbow and wrist of the k-th point in the point cloud.

Body Direction Estimation

The reconstruction algorithm mentioned above assumes that the facing direction of the user is constant during data acquisition. However, this assumption restricts the application of the reconstruction algorithms in a real-life environment in which the facing direction constantly changes. Another algorithm opportunistically uses arm poses that can reveal the user's facing direction to correct the body direction. However, this estimation largely depends on the user's activity, and can have significant errors when the user is sitting down and eating.

Therefore, a novel body direction estimation algorithm is advantageously provided to estimate the facing direction(s) during a certain activity from sensor readings collected from the wearable device, such as a smartwatch. It takes the orientation of the smartwatch as input, and gives estimated body direction as output. For a given wrist orientation Rot_(w0) at a certain moment, all possible body directions d_(i) from 0 to 360 degrees are iterated relative to initial facing direction. The probability of the event that the wrist orientation Rot_(w)=Rot_(w0) is caused by facing direction d=d₁ conditioned on the user doing activity Â is then estimated. The probability is estimated by adding up weights in the weighted point cloud C_(i) obtained by querying Rot_(w) _(i) =rotate (Rot_(w0), d_(i)) in the weighted dictionary optimized for activity Â.

$P\left( {{{Ro}t_{w}} = {{Rot_{w0}\left. {{A = \ \hat{A}},{d = d_{1}}} \right)} = \frac{\sum\limits_{p_{j} \in C_{1}}{w\left( p_{j} \right)}}{\sum\limits_{i}{\sum\limits_{p_{j \in C_{1}}}{w\left( p_{j} \right)}}}}} \right.$

The initial estimation can be very noisy. A filter can then be applied to reduce the noise. For example, a Kalman filter can be applied to utilize the time correlation of the data series. Though the weighted dictionary utilized in the user study was optimized for in-lab eating activity, the body direction estimation algorithm performs well in a free-living setting as well, as illustrated in FIG. 13. The data shown in the figure was collected during and after a meal by an individual. One of the plots indicates the estimated facing direction after application of the Kalman filter, and the other shows the observed facing direction calculated from a smartphone worn around the user's wrist.

Arm Posture Reconstruction

After obtaining the weighted point cloud from wrist orientation, all points in the point cloud are averaged to generate the reconstructed arm posture. As for the point cloud derived from a weighted dictionary, all points are averaged based on their corresponding weights.

Machine Learning Task Classification

Intake Segmentation

Having the reconstructed arm pose, all the segments that may possibly be an eating intake based on the distance from the hand to the shoulder point on the other side of the body are first segmented out. As shown in FIG. 14, each intake will initially appear as a valley in the distance plot. The distance is smoothed using a window size of 5 s centered at the reference point, and then the distance is subtracted by the smoothed distance. Valleys below the horizontal axis are segmented out as valley segments. Valley segments that are longer than 0.4 s and have a distance change greater than 0.1 forearm length are segmented out as potential intakes. Note that the segmented potential intakes do not overlap each other.

Eating Behavior Detection

After segmenting out potential intakes, the location of elbow and wrist, the palm's facing vector, distance between wrist and the shoulder point of the other arm, and the rotation angle of the wrist from the reconstruction are selected as features to be put into an SVM model, referred to herein as SVM-eating, for eating behavior detection. The detailed features fed into the SVM model are shown in Table 3 below. In some examples, each segment is resampled to 30 frames to provide uniform length for the SVM model.

TABLE 3 Features from the Reconstruction Features from the Reconstruction Coordinate of elbow relative to TCS (loc_(e) _(x) , loc_(e) _(y) , loc_(e) _(z) ) Coordinate of wrist relative to TCS (loc_(w) _(x) , loc_(w) _(y) , loc_(w) _(z) ) Coordinate of palm's facing vector relative to TCS (f_(x), f_(y), f_(z)) Distance between wrist and shoulder point of the other arm d Rotation angle of wrist θ

The SVM-eating model is trained to use the features above as input to classify whether the segment is an eating intake or not. In the training process, if a ground-truth intake moment is in the range of the segment, the training label of that segment is assigned as eating, otherwise the training label is assigned as not eating. In the evaluation process, the model gives the prediction of segmented potential intakes as eating or not eating.

Utensil/Food Classification

For utensil classification or food classification, new SVM models are trained, SVM-utensil or SVM-food, to classify the detected eating in-take segment into food or utensil categories using the same features above as input. In the training process, the model only uses the segments which contain ground-truth intakes as training data. In the evaluation process, the model gives the prediction of utensil or food type of the segments which is predicted by the model SVM-eating as eating.

Additional aspects of the user study conducted to evaluate the performance of the EatingTrak system will now be described.

The user study used to evaluate the example system involved an in-lab scenario, a semi-wild scenario and a free-living scenario, and was approved by the Institutional Research Board. With 12 participants (9 male), three sessions were held, one for each of the study conditions. Throughout the study, each of the participants wore an Apple Watch Series 4 on his or her dominant hand. The sensor data was recorded using an application, Sensor Log. Due to hardware failure and misconfiguration, complete data was not recorded for 3 sessions. Therefore, the free-living scenario study was re-conducted for one participant and the in-lab study was re-conducted for two participants.

Experimental Settings

In-Lab Scenario

In the in-lab scenario, each participant was instructed to eat and drink the food types illustrated in Table 4 using the corresponding utensils. The food types were selected based on the five food groups mentioned by the U.S. Department of Agriculture. Thus, the selected food types meet the necessary nutrition requirement that creates a balanced diet. To satisfy participants' dietary restrictions, vegetarian options were included.

TABLE 4 Food Type for In-Lab Setting Utensil Spoon Fork Hand Cup Food Cereal Salad Raisins Water Yogurt Apples Chips

The participants wore the Apple Watch devices on their respective dominant hands and were instructed to only eat with the wrist-worn arm. This scenario was divided into 6 identical and independent sessions. The in-lab study was conducted two times per participant, each of which contains 3 sessions of eating. In each session, they were instructed to take 10 intakes for all food types. Although they were given the freedom to choose the order of the food, they were instructed to take 10 intakes at a time before starting to consume another type of food. The ground-truth was collected by video recording the entire in-lab user study of every participant. Also, the exact time of intake moment during the study was recorded.

Semi-Wild Scenario

The semi-wild scenario required the participants to wear Apple Watch on their dominant hand to collect sensor data of wrist and a chest-mounted GoPro device that captured participants' intake behaviors. The chest-mounted GoPro device was selected to allow the participants to conduct activities as naturally as possible. They were asked to conduct the following activities: walk up the stairs in the building, use a cell phone for 1 to 2 minutes, type using a laptop, have a conversation with the study conductor and clean up the food from part one in-lab user study. These non-eating activities represent normal daily routines. During the study, the participants were instructed that they may or may not have some chips. Only one participant did not eat any snacks while performing the activities. The 12 participants of the study completed the semi-wild setting of an average of 17.42 minutes. The ground-truth intake moment was manually labeled from video recordings of the GoPro device.

Free-Living Scenario

To detect when the participants ate during a totally unrestricted free-living scenario, a free-living session was conducted with the same participants from the in-lab and semi-wild study. They were asked to wear a wrist-mounted Apple Watch on their dominant hand to collect the sensor data of the wrist along with chest-mounted GoPro device pointing towards their mouths. After they put on the devices, they were asked to have a meal in their normal daily environment without any limitations to the food type and behavior. They were given a maximum of 1 hour (limited by GoPro battery) to complete a meal. They were not required to log any information about their eating behaviors throughout this study. The 12 participants of the study completed the study of an average of 27.08 minutes. The ground-truth intake moment was manually labeled from video recordings of the GoPro device.

Results

The results for the three study scenarios will now be described. Note that all the results presented here are on per-intake level. For eating detection, a true positive occurs when the example model predicts that a segment is eating and the ground-truth intake action is within the range of the segment. A false positive occurs if the example model predicts that a segment is eating but no ground-truth per-intake action exist within the segment. A false negative occurs when an intake action exists but none of the segments that the example model predicts as eating contain it. For utensil and food classifications, only the segments that the example model predicts as eating are considered. Within these segments, the accuracy of the utensils and food classifications is defined as follows:

${Accuracy} = \frac{\#\mspace{14mu}{Segments}\mspace{14mu}{Correctly}\mspace{14mu}{Classified}}{\#\mspace{14mu}{Segments}\mspace{14mu}{Correctly}\mspace{14mu}{Predicted}\mspace{14mu}{as}\mspace{14mu}{Eating}}$

In-Lab Scenario

In the in-lab scenario, each participant has 10 intakes per food type for each of the 6 sessions. For each participant, user-dependent SVM models were trained for eating behavior detection, utensil type classification and food type classifications, respectively. The results are based on cross validation among the 6 sessions: to evaluate on each session, the model was trained on the other five sessions. Across all 12 participants, for eating behavior detection the overall F₁ score is 92.2% with precision 93.9% and recall 90.5%. The accuracy for utensil type classification is 83.7%. Food type classification returned an average accuracy of 64.6%.

Cup in utensils with 95% accuracy and water in food types with 95% accuracy attain the highest accuracy. This result may be caused by the difference of wrist rotations for drinking and eating behaviors. During drinking, the wrist rotates outwards while for all eating behaviors, the wrist rotates inwards. Given the confusion matrix of utensil type, the model achieved accuracy greater than or equal to 80% for every utensil types. The accuracy for food type is lower than that of utensil type. This result can be attributed to the similar arm postures when using the same utensils. The confusion matrix of food type shows that the confusion concentrated between food using the same utensils, e.g., cereal/yogurt, salad/apples and raisins/chips.

Semi-Wild Scenario

In the semi-wild scenario, the average number of intakes among the 12 participants is 23.6. User-independent SVM models were trained for eating behavior detection and the reported results are based on cross validation among participants: To evaluate on each participant, the model was trained on other participants' data. The overall F₁ score is 77.0% with precision 78.9% and recall 75.3%. P12 has the highest precision, recall and F₁ score while P4 has the lowest. The low results for P4 could be due to different eating behavior compared to ground-truth video, for example, P4 held the chips near the mouth for a long time and ate them slowly while the other participants ate the chips faster. If the models are trained and the results calculated without P4, the overall F₁ score, precision and recall across the remaining 11 participants will be 83.8%, 87.1% and 80.8% respectively.

Free-Living Scenario

In the free-living scenario, the average number of intakes among the 12 participants is 50.9. User-dependent SVM models were trained for eating behavior detection and the results reported are based on cross validation among participants: To evaluate on each participant, the model was trained on the data collected in their in-lab scenario and in their semi-wild scenario as well as in others' free-living scenario. The overall F₁ score was 58.5.0% with precision 69.0% and recall 50.7%. The variance among participants is very large. P12 has the highest precision, recall and F₁ score while P11 has the lowest. The results could be caused by not setting any constraints in the free-living scenario. Therefore, the utensil types and eating behaviors varies per participant. For participants who used utensils that are included in the in-lab setting, e.g., P12 used a spoon, and achieved relatively higher results than other participants. Yet, participant P11 had a distinct eating behavior compared to other participants such as raising hand near the mouth or keeping the arm static while eating a beef stick. However, the system works under the assumption that whenever an individual is having a meal, the arm is not stationary. Therefore, P11 achieved lowest precision, recall and F₁ score.

User Dependency

For the in-lab study, user-dependent models were trained for different participants. A user-independent model was also trained. Across all 12 participants, for eating behavior detection the overall F₁ score is 91.8% with precision 93.2% and recall 90.5%, which are only 0.4%, 0.7% and 0.0% lower than the user-dependent method. The overall accuracy of utensil type classification and food type classification is 74.7% and 52.9% respectively, which are 9.0% and 11.7% lower than that of the user-dependent method.

For the semi-wild study, user-independent models were trained for the participants. A user-dependent model was also trained by adding the data collected in each participant's in-lab scenario to the training set. The overall F₁ score is 74.2% with precision 71.3% and recall 77.4%, which is not higher than the user-independent models.

Therefore, for eating or non-eating actions, the difference among participants is not very large and it is possible to train models in a user-independent way. However, for different utensil type and food type, each participant may have different styles of eating; therefore, training user-dependent models can deliver better performance.

Time Resolution

Most of the previous work recognize the eating activities with a time resolution of minutes in free-living scenarios. EatingTrak recognizes eating activities at intake-level, which presents a higher time resolution in seconds. Therefore, the performance of EatingTrak can be easily improved by lowering the time resolution. For instance, instead of identifying and classifying the intake events, the system can be configured to collectively make a decision using the intake-level recognition results over a larger time window (e.g., 30 seconds). The longer the window, the greater the confidence in the result.

Further Improvements on Pose Reconstruction

The EatingTrak system relies on the performance of the reconstructed arm posture to detect eating behaviors. The reconstruction accuracy can significantly impact the performance of the system. Therefore, the arm posture reconstruction algorithm could be potentially improved by adopting the following approaches. First, the reconstruction depends on the facing direction estimation module. The current facing direction estimation algorithm does not take into account the correlation of facing direction between frames and the constraints of human movement. Second, the current reconstruction algorithm assumes that the torso is mostly upright. However, information about the inclination of the torso could improve the accuracy of the reconstruction. Third, the reconstruction could also use more time serial information. A time series data processing model (e.g., a Hidden Markov Model) can be adopted to further optimize the classification results.

Applying EatingTrak on Other Daily Activities

The EatingTrak system in some embodiments is configured to reconstruct arm posture from a single wrist-mounted smartwatch. From the reconstructed arm pose, the system detects eating behavior in semi-controlled and free-living scenarios at a per-intake level. However, besides eating activities, the disclosed techniques can be applied to detecting other fine-grained daily activities based on the reconstructed arm postures. This application is optimistic for the following reasons. First, illustrative embodiments can be implemented using one off-the-shelf commodity as the hardware setup. Second, the body direction estimation module enables the example system to be applied to various free-living scenarios. Third, the averaging reconstruction algorithm of point clouds and weighted point clouds can be implemented in real-time. Fourth, the system can be easily optimized for other activities by optimizing the weighted dictionary for that activity. For example, by reconstructing arm posture, the disclosed system is able to detect other fine-grained daily activities using a commodity smartwatch as well.

The EatingTrak system as described above provides a wearable technology using a commodity smartwatch to recognize fine-grained eating and drinking activities at a per-intake level in lab, semi-wild, and free-living conditions. A user study was conducted with 12 participants, which showed that the example system was able to identify eating intake events with an average F₁-score of 92.2%, 77.0% and 58.5% in the in-lab, semi-wild, free-living scenarios. It also classified eating intake events for 4 types of utensil and 7 types of food with an average accuracy of 83.7% and 64.6% respectively.

It is to be appreciated that the various details associated with the example implementations of the EatingTrak system as described above are presented for purposes of illustration only, and should not be construed as limiting in any way. Other types and arrangements of systems, devices and components can implement machine learning based activity detection utilizing reconstructed 3D arm postures as disclosed herein.

FIG. 15 shows an information processing system 1500 implementing machine learning based activity detection utilizing an IMU in an illustrative embodiment. The system 1500 comprises a processing platform 1502. Coupled to the processing platform 1502 are an IMU 1505 and one or more controlled system components 1506. The IMU 1505 is illustratively a wrist-mounted IMU, although other arrangements of one or more IMUs and additional or alternative data sources can be used in other embodiments.

In some embodiments, the IMU 1505 is part of a wearable device of the user, such as a smartwatch, a wrist-mounted fitness device, or a finger-mounted device (e.g., a ring). Although only a single IMU is shown in this embodiment, other embodiments can utilize multiple IMUs and/or additional or alternative sensors associated with one or more user devices.

In some embodiments, the processing platform 1502 and/or the controlled system components 1506 are implemented at least in part within a user device, such as a smartphone, smartwatch or wearable device of a user. For example, in some embodiments, the processing platform 1502 comprises a smartphone that is in communication with a smartwatch or other wearable device of the user, with the wearable device comprising the IMU 1505.

Additionally or alternatively, the IMU 1505 and the controlled system components 1506 may represent the same processing device, or different components of that same processing device, such as a smartphone, smartwatch, fitness device, wearable device, or other user device of a particular system user.

A wide variety of other processing device implementations are possible for the processing platform 1502, IMU 1505, controlled system components 1506 and other components of the system 1500 in other embodiments.

Accordingly, the various components of the system 1500 as illustrated in FIG. 15 can be part of a single processing device, such as a wearable device or other user device, or distributed across multiple processing devices, such as a smartphone and a wearable device, or other combinations of multiple user devices, computers and/or other processing devices. Also, additional or alternative system components can be used in place of those shown.

The processing platform 1502 as illustrated comprises a machine learning system 1510 and at least one component controller 1512. The machine learning system 1510 in the present embodiment more particularly implements one or more machine learning algorithms, such as machine learning based activity detection algorithms of the type described elsewhere herein, although other arrangements are possible.

In operation, the processing platform 1502 is illustratively configured to obtain data from the IMU 1505, to generate 3D arm pose estimates from the obtained data, and to apply the generated 3D arm pose estimates to machine learning system 1510. The machine learning system 1510 is illustratively trained to recognize temporal-spatial patterns of one or more designated activities, such as eating activities and/or other types of activities to be detected in a particular activity detection context. As indicated previously, arm poses are also referred to herein as respective “arm postures.”

The processing platform 1502 is further configured to obtain at least one classification output from the machine learning system 1510, and to generate at least one control signal based at least in part on the classification output. The control signal is illustratively generated by the component controller 1512.

In some embodiments, the control signal is configured to guide a user toward one or more target activities, possibly by providing such user guidance via a smartphone, wearable device, smart home assistant or other user device. The one or more target activities illustratively comprise one or more activities associated with a particular desired health condition of the user, such as a desired eating activity pattern.

In some embodiments, the machine learning system 1510 is trained on one or more sets of 3D arm pose estimate training data using supervised learning. Other types of training data and learning techniques can be used in other embodiments.

A given one of the 3D arm pose estimates illustratively comprises a combination of an estimate of a wrist orientation relative to a body and an estimate of an elbow orientation relative to the body, wherein the elbow orientation relative to the body is determined at least in part based on the wrist orientation relative to the body.

In some embodiments, the wrist orientation relative to the body is determined based at least in part on an estimate of a direction of the body and an estimate of a wrist orientation relative to a specified frame of reference.

It is to be appreciated that other types and configurations of 3D arm pose estimates can be used in other embodiments.

Additionally or alternatively, generating 3D arm pose estimates illustratively comprises generating a multivariate time series of 3D arm pose estimates comprising a plurality of data segments associated with respective frames of a specified sliding time window. These and other types of 3D arm pose estimates are illustratively generated in the processing platform 1502 and supplied as inputs to the machine learning system 1510.

In some embodiments, the machine learning system 1510 comprises at least one SVM model, although it is to be appreciated that other machine learning arrangements, including convolutional neural networks, deep neural networks, and other types of neural networks, can additionally or alternatively be used.

In some embodiments, wrist orientation data is obtained from the IMU 1505 and processed as described below.

Generating 3D arm pose estimates from the obtained data in such an arrangement illustratively comprises, for each of a plurality of body directions, calculating wrist orientation relative to a torso coordinate system using wrist orientation relative to an earth coordinate system and the body direction, looking up the calculated wrist orientation relative to the torso coordinate system in a weighted dictionary generated using actual arm poses to determine a weighted point cloud, and utilizing the weighted point cloud to assign a probability to the body direction based at least in part on weights of the weighted point cloud.

This process illustratively further comprises selecting a particular one of the body directions based at least in part on their respective assigned probabilities, utilizing the selected body direction to transform the wrist orientation relative to the earth coordinate system to determine a derived wrist orientation relative to the torso coordinate system, looking up the derived wrist orientation relative to the torso coordinate system in the weighted dictionary to determine a weighted point cloud, and determining a given one of the 3D arm pose estimates based at least in part on weights of the weighted point cloud.

In some embodiments, applying the generated 3D arm pose estimates to the machine learning system activities comprises extracting possible intake gestures of the generated 3D arm pose estimates into respective segments, resampling each of at least a subset of the extracted segments, and utilizing the SVM model to classify whether or not each of one or more of the extracted and resampled segments comprises an intake gesture.

Other types and arrangements of machine learning models having different configurations and functionality can be used in implementing the machine learning system 1510 in other embodiments.

In some embodiments, generating at least one control signal based at least in part on the classification output further comprises at least one of generating at least one guidance signal provided to at least one user device, such as a smartphone or wearable device, generating at least a portion of at least one output display for presentation on at least one user device, and generating at least one alert for delivery to at least one user device.

As indicated above, control signals in these and other arrangements can be used to guide the user toward one or more target activities, or to provide additional or alternative functions, via one or more user devices.

As a more detailed example, the control signal may be configured to trigger at least one of a smartphone, a wearable device and a smart home assistant, or other type of user device or combination of multiple user devices, to provide particular output to guide the user toward the one or more target activities.

The above-described activity detection and remediation process implemented in system 1500 can be similarly applied for each of a plurality of different users of the system 1500.

It is to be appreciated that the term “machine learning system” as used herein is intended to be broadly construed to encompass at least one machine learning algorithm configured for at least one of activity detection and remediation using one or more neural networks. The processing platform 1502 may therefore be viewed as an example of a “machine learning system” as that term is broadly used herein. Detailed examples of particular implementations of machine learning algorithms implemented by machine learning system 1510 in illustrative embodiments are described in detail elsewhere herein.

The component controller 1512 generates one or more control signals for adjusting, triggering or otherwise controlling various operating parameters associated with the controlled system components 1506 based at least in part on outputs generated by the machine learning system 1510. A wide variety of different type of devices or other components can be controlled by component controller 1512, possibly by applying control signals or other signals or information thereto, including additional or alternative components that are part of the same processing device or set of processing devices that implement the processing platform 1502 and/or the IMU 1505.

Such control signals, and additionally or alternatively other types of signals and/or information, can be communicated over one or more networks to other processing devices, such as user terminals or other user devices associated with respective system users.

In some embodiments, the component controller 1512 can be implemented within the machine learning system 1510, rather than as a separate element of processing platform 1502 as shown in the figure.

The processing platform 1502 is configured to utilize a detection and remediation database 1514. Such a database illustratively stores user data, user profiles and a wide variety of other types of information, including data from the IMU 1505, that may be utilized by the machine learning system 1510 in performing activity detection and associated remediation operations. The detection and remediation database 1514 is also configured to store related information, including various processing results, such as classifications or other outputs generated by the machine learning system 1510. Although shown as separate from the processing platform 1502 in this embodiment, it can be implemented at least in part internally to the processing platform 1502 in other embodiments, illustratively using memory 1522 and/or one or more other storage devices of the processing platform 1502.

The component controller 1512 utilizes outputs generated by the machine learning system 1510 to control one or more of the controlled system components 1506. The controlled system components 1506 in some embodiments therefore comprise system components that are driven at least in part by outputs generated by the machine learning system 1510. For example, a controlled component can comprise a processing device such as a computer or smartphone that presents a display to a user and/or directs a user to adjust its behavior in a particular manner responsive to an output of a machine learning system. These and numerous other different types of controlled system components 1506 can make use of outputs generated by the machine learning system 1510, including various types of equipment and other systems associated with one or more of the example use cases described elsewhere herein.

Although the machine learning system 1510 and the component controller 1512 are both shown as being implemented on processing platform 1502 in the present embodiment, this is by way of illustrative example only. In other embodiments, the machine learning system 1510 and the component controller 1512 can each be implemented on a separate processing platform. A given such processing platform is assumed to include at least one processing device comprising a processor coupled to a memory.

Examples of such processing devices include computers, servers or other processing devices arranged to communicate over a network. Storage devices such as storage arrays or cloud-based storage systems used for implementation of detection and remediation database 1514 are also considered “processing devices” as that term is broadly used herein.

The network can comprise, for example, a global computer network such as the Internet, a wide area network (WAN), a local area network (LAN), a satellite network, a telephone or cable network, a cellular network such as a 3G, 4G or 5G network, a wireless network implemented using a wireless protocol such as Bluetooth, WiFi or WiMAX, or various portions or combinations of these and other types of communication networks.

It is also possible that at least portions of other system elements such as one or more of the IMU 1505 and/or the controlled system components 1506 can be implemented as part of the processing platform 1502, although shown as being separate from the processing platform 1502 in the figure.

For example, in some embodiments, the system 1500 can comprise a smartphone, wearable device or other user device, as well as combinations of multiple such processing devices, configured to incorporate at least one IMU or other type of sensor and to execute a machine learning system for controlling at least one system component.

In some embodiments, outputs of the machine learning system 1510 are utilized to automatically populate an eating journal or other activity journal in an application running on a smartphone, wearable device or other user device. Such arrangements overcome the significant disadvantages of conventional manual journaling approaches described elsewhere herein.

Some embodiments are configured to implement various types of automated remedial actions, driven by outputs of the machine learning system 1510. Examples of automated remedial actions that may be taken in the processing platform 1502 responsive to outputs generated by the machine learning system 1510 include generating in the component controller 1512 at least one control signal for controlling at least one of the controlled system components 1506 over a network, generating at least a portion of at least one output display for presentation on at least one user terminal, generating an alert for delivery to at least user terminal over a network, and/or storing the outputs in the detection and remediation database 1514.

A wide variety of additional or alternative automated remedial actions may be taken in other embodiments. The particular automated remedial action or actions will tend to vary depending upon the particular use case in which the system 1500 is deployed.

For example, some embodiments implement machine learning based activity detection and remediation algorithms to at least partially automate various aspects of patient care in healthcare applications such as telemedicine. Such applications illustratively involve a wide variety of different types of remote medical monitoring and intervention.

An example of an automated remedial action in this particular context includes generating at least one output signal, such as a feedback signal indicating that the user is eating too fast or otherwise deviating from desired target eating behavior, for presentation on a smartphone, wearable device or other type of user device, in order to guide the user to modify his or her eating behavior.

The feedback signal in some embodiments can be used to automatically guide the user to a desired target eating rate. Additionally or alternatively, one or more such feedback signals can be used to control other aspects of eating activity, such as when a user eats, what a user eats and how much a user eats. Other types of user behaviors can be similarly guided using outputs of machine learning system 1510 in other embodiments.

It is to be appreciated that the term “automated remedial action” as used herein is intended to be broadly construed, so as to encompass the above-described automated remedial actions, as well as numerous other actions that are automatically driven based at least in part on outputs of a machine learning based activity detection algorithm as disclosed herein, with such actions being configured to address or otherwise remediate various detected conditions.

The processing platform 1502 in the present embodiment further comprises a processor 1520, a memory 1522 and a network interface 1524. The processor 1520 is assumed to be operatively coupled to the memory 1522 and to the network interface 1524 as illustrated by the interconnections shown in the figure.

The processor 1520 may comprise, for example, a microprocessor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a central processing unit (CPU), a graphics processing unit (GPU), a tensor processing unit (TPU), an arithmetic logic unit (ALU), a digital signal processor (DSP), or other similar processing device component, as well as other types and arrangements of processing circuitry, in any combination. At least a portion of the functionality of at least one machine learning systems and its associated activity detection and remediation algorithms provided by one or more processing devices as disclosed herein can be implemented using such circuitry.

In some embodiments, the processor 1520 comprises one or more graphics processor integrated circuits. Such graphics processor integrated circuits are illustratively implemented in the form of one or more GPUs. Accordingly, in some embodiments, system 1500 is configured to include a GPU-based processing platform. Such a GPU-based processing platform can be cloud-based configured to implement one or more machine learning systems for processing data associated with a large number of system users. Similar arrangements can be implemented using TPUs and/or other processing devices.

Numerous other arrangements are possible. For example, in some embodiments, a machine learning system can be implemented on a single processor-based device, such as a smartphone, smartwatch, fitness device, wearable device, or other type of user device, utilizing one or more processors of that device. Such embodiments are also referred to herein as “on-device” implementations of machine learning systems.

Also, the machine learning system 1510 can be distributed across multiple processing devices, for example, with a full version of the machine learning system 1510 being implemented in a cloud-based processing system and a less computationally intensive or “lightweight” version of the machine learning system 1510 being downloaded to a wearable device or other user device from the cloud over a network.

The memory 1522 stores software program code for execution by the processor 1520 in implementing portions of the functionality of the processing platform 1502. For example, at least portions of the functionality of machine learning system 1510 and component controller 1512 can be implemented using program code stored in memory 1522.

A given such memory that stores such program code for execution by a corresponding processor is an example of what is more generally referred to herein as a processor-readable storage medium having program code embodied therein, and may comprise, for example, electronic memory such as SRAM, DRAM or other types of random access memory, flash memory, read-only memory (ROM), magnetic memory, optical memory, or other types of storage devices in any combination.

Articles of manufacture comprising such processor-readable storage media are considered embodiments of the invention. The term “article of manufacture” as used herein should be understood to exclude transitory, propagating signals.

Other types of computer program products comprising processor-readable storage media can be implemented in other embodiments.

In addition, illustrative embodiments may be implemented in the form of integrated circuits comprising processing circuitry configured to implement processing operations associated with one or both of the machine learning system 1510 and the component controller 1512 as well as other related functionality. For example, at least a portion of the machine learning system 1510 is illustratively implemented in at least one neural network integrated circuit of a processing device of the processing platform 1502.

The network interface 1524 is configured to allow the processing platform 1502 to communicate over one or more networks with other system elements, and may comprise one or more conventional transceivers.

It is to be appreciated that the particular arrangement of components and other system elements shown in FIG. 15 is presented by way of illustrative example only, and numerous alternative embodiments are possible. For example, other embodiments of information processing systems can be configured to implement machine learning system functionality of the type disclosed herein.

Terms such as “data source” and “controlled system component” as used herein are intended to be broadly construed. For example, a given set of data sources in some embodiments can comprise a smartphone of a user, one or more wearable devices of the user, one or more smart home assistants associated with the user, and/or one or more sensors associated with the user. Additionally or alternatively, data sources can comprise video cameras, sensor arrays or other types of imaging devices or data capture devices.

It is to be appreciated that references above and elsewhere herein to terms such as “cameras” or “sensors” are intended to be broadly construed, although numerous other types of imaging devices and/or data capture devices, in any combination, can be used in addition to or in place of cameras or sensors.

Other examples of “data sources” as that term is broadly used herein include various types of databases or other storage systems accessible over a network. A wide variety of different types of data sources can be used to provide input data to a machine learning system in illustrative embodiments. Such data sources can be used in addition to or in place of IMU 1505 in other embodiments. The IMU 1505 is considered a type of “data source” as that term is broadly used herein.

A given controlled component can illustratively comprise a computer, smartphone, wearable device or other type of processing device that receives an output from a machine learning system and performs at least one automated remedial action in response thereto.

Illustrative embodiments of the system 1500 can be configured, for example, to detect eating behavior of a user and to associate deviations from optimal eating behavior with particular behavior-related conditions.

The system 1500 can be configured to support a wide variety of distinct applications, in numerous diverse contexts.

For example, with automated assistance from the system 1500, users can be provided with intervention or other treatment in an automated manner based upon activity detection, before health conditions deteriorate.

In some embodiments, the system 1500 can generate personalized intervention suggestions based upon the activity detection.

In some embodiments, when the system 1500 detects any significant deviations of eating activities or other user activities from optimal patterns, the system 1500 will “nudge” or otherwise guide a user to adjust his or her behavior in order for the deviated patterns to get back on track. It is to be appreciated that numerous additional or alternative remedial actions can be triggered by machine learning based activity detection as disclosed herein.

It is to be appreciated that the particular arrangements described above are examples only, intended to demonstrate utility of illustrative embodiments, and should not be viewed as limiting in any way.

Automated remedial actions taken based on outputs generated by a machine learning system of the type disclosed herein can include particular actions involving interaction between a processing platform implementing the machine learning system and other related equipment utilized in one or more of the use cases described above. For example, outputs generated by a machine learning system can control one or more components of a related system. In some embodiments, the machine learning system and the related equipment are implemented on the same processing platform, which may comprise a computer, a smartphone, a wearable device or other type of processing device.

In one embodiment, a method is provided that includes obtaining data from an IMU of a user, generating 3D arm pose estimates from the obtained data, applying the generated 3D arm pose estimates to a machine learning system trained to recognize temporal-spatial patterns of one or more designated activities, obtaining at least one classification output from the machine learning system, and generating at least one control signal based at least in part on the classification output, possibly to guide the user toward one or more target activities, or to control performance of additional or alternative functions. The method is performed by at least one processing device comprising a processor coupled to a memory.

In some embodiments, the IMU is part of a wearable device of the user, such as a smartwatch, wrist-mounted fitness device or the like.

Additionally or alternatively, in some embodiments, the at least one processing device comprises a wearable device of the user.

In some embodiments, the at least one processing device comprises a smartphone of the user that is in communication with a wearable device of the user, with the wearable device comprising the IMU.

In some embodiments, the one or more designated activities comprise eating activities and the one or more target activities comprise one or more activities associated with a particular desired health condition of the user.

In some embodiments, the machine learning system is trained on one or more sets of 3D arm pose estimate training data using supervised learning.

A given one of the 3D arm pose estimates in some embodiments comprises a combination of an estimate of a wrist orientation relative to a body and an estimate of an elbow orientation relative to the body. The elbow orientation relative to the body is illustratively determined at least in part based on the wrist orientation relative to the body. The wrist orientation relative to the body is illustratively determined based at least in part on an estimate of a direction of the body and an estimate of a wrist orientation relative to a specified frame of reference.

In some embodiments, generating 3D arm pose estimates comprises generating a multivariate time series of 3D arm pose estimates comprising a plurality of data segments associated with respective frames of a specified sliding time window.

In some embodiments, the machine learning system comprises an SVM classifier, although other types of classifiers can be used, such as a convolutional neural network classifier, a deep neural network classifier, or another type of classifier comprising at least one neural network, as well as numerous other arrangement of machine learning models or algorithms.

In some embodiments, generating at least one control signal based at least in part on the classification output further comprises at least one of generating at least one guidance signal provided to at least one user device, generating at least a portion of at least one output display for presentation on at least one user device, and generating at least one alert for delivery to at least one user device. Such arrangements may be used, for example, to guide the user toward one or more target activities, such as a desired eating activity pattern, or to control performance of additional or alternative functions.

In another embodiment, a system comprises at least one processing device comprising a processor coupled to a memory, with the processing device being configured to obtain data from an inertial measurement unit of a user, to generate 3D arm pose estimates from the obtained data, to apply the generated 3D arm pose estimates to a machine learning system trained to recognize temporal-spatial patterns of one or more designated activities, to obtain at least one classification output from the machine learning system, and to generate at least one control signal based at least in part on the classification output. For example, in some embodiments, the control signal can be configured to guide the user toward one or more target activities, or to control performance of additional or alternative functions.

In a further embodiment, a computer program product comprises a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code, when executed by at least one processing device comprising a processor coupled to a memory, causes the processing device to obtain data from an inertial measurement unit of a user, to generate 3D arm pose estimates from the obtained data, to apply the generated 3D arm pose estimates to a machine learning system trained to recognize temporal-spatial patterns of one or more designated activities, to obtain at least one classification output from the machine learning system, and to generate at least one control signal based at least in part on the classification output. For example, in some embodiments, the control signal can be configured to guide the user toward one or more target activities, or to control performance of additional or alternative functions.

As indicated previously, additional details regarding these and other illustrative embodiments are described elsewhere herein. The information processing systems, processing platforms, processing devices, networks, machine learning systems and other systems and components disclosed herein in conjunction with FIG. 15 and other embodiments are illustratively configured in some embodiments to implement the example algorithms and other techniques that are described herein.

It should also be understood that the particular arrangements shown and described in conjunction with FIG. 15 are presented by way of illustrative example only, and numerous alternative embodiments are possible. The various embodiments disclosed herein should therefore not be construed as limiting in any way. Numerous alternative arrangements of machine learning based activity detection can be utilized in other embodiments. Those skilled in the art will also recognize that alternative processing operations and associated system entity configurations can be used in other embodiments.

It is therefore possible that other embodiments may include additional or alternative system elements, relative to the entities of the illustrative embodiments. Accordingly, the particular system configurations and associated algorithm implementations can be varied in other embodiments.

A given processing device or other component of an information processing system as described herein is illustratively configured utilizing a corresponding processing device comprising a processor coupled to a memory. The processor executes software program code stored in the memory in order to control the performance of processing operations and other functionality. The processing device also comprises a network interface that supports communication over one or more networks.

The processor may comprise, for example, a microprocessor, an ASIC, an FPGA, a CPU, a GPU, a TPU, an ALU, a DSP, or other similar processing device component, as well as other types and arrangements of processing circuitry, in any combination. For example, at least a portion of the functionality of at least one machine learning system, and its machine learning algorithms for activity detection and remediation, provided by one or more processing devices as disclosed herein, can be implemented using such circuitry.

The memory stores software program code for execution by the processor in implementing portions of the functionality of the processing device. A given such memory that stores such program code for execution by a corresponding processor is an example of what is more generally referred to herein as a processor-readable storage medium having program code embodied therein, and may comprise, for example, electronic memory such as SRAM, DRAM or other types of random access memory, ROM, flash memory, magnetic memory, optical memory, or other types of storage devices in any combination.

As mentioned previously, articles of manufacture comprising such processor-readable storage media are considered embodiments of the invention. The term “article of manufacture” as used herein should be understood to exclude transitory, propagating signals. Other types of computer program products comprising processor-readable storage media can be implemented in other embodiments.

In addition, embodiments of the invention may be implemented in the form of integrated circuits comprising processing circuitry configured to implement processing operations associated with implementation of a machine learning system.

An information processing system as disclosed herein may be implemented using one or more processing platforms, or portions thereof.

For example, one illustrative embodiment of a processing platform that may be used to implement at least a portion of an information processing system comprises cloud infrastructure including virtual machines implemented using a hypervisor that runs on physical infrastructure. Such virtual machines may comprise respective processing devices that communicate with one another over one or more networks.

The cloud infrastructure in such an embodiment may further comprise one or more sets of applications running on respective ones of the virtual machines under the control of the hypervisor. It is also possible to use multiple hypervisors each providing a set of virtual machines using at least one underlying physical machine. Different sets of virtual machines provided by one or more hypervisors may be utilized in configuring multiple instances of various components of the information processing system.

Another illustrative embodiment of a processing platform that may be used to implement at least a portion of an information processing system as disclosed herein comprises a plurality of processing devices which communicate with one another over at least one network. Each processing device of the processing platform is assumed to comprise a processor coupled to a memory. A given such network can illustratively include, for example, a global computer network such as the Internet, a WAN, a LAN, a satellite network, a telephone or cable network, a cellular network such as a 3G, 4G or 5G network, a wireless network implemented using a wireless protocol such as Bluetooth, WiFi or WiMAX, or various portions or combinations of these and other types of communication networks.

Again, these particular processing platforms are presented by way of example only, and an information processing system may include additional or alternative processing platforms, as well as numerous distinct processing platforms in any combination, with each such platform comprising one or more computers, servers, storage devices or other processing devices.

A given processing platform implementing a machine learning system as disclosed herein can alternatively comprise a single processing device, such as a computer, a smartphone, a wearable device, that implements not only the machine learning system but also at least one data source and one or more controlled components. It is also possible in some embodiments that one or more such system elements can run on or be otherwise supported by cloud infrastructure or other types of virtualization infrastructure.

It should therefore be understood that in other embodiments different arrangements of additional or alternative elements may be used. At least a subset of these elements may be collectively implemented on a common processing platform, or each such element may be implemented on a separate processing platform.

Also, numerous other arrangements of computers, servers, storage devices or other components are possible in an information processing system. Such components can communicate with other elements of the information processing system over any type of network or other communication media.

As indicated previously, components of the system as disclosed herein can be implemented at least in part in the form of one or more software programs stored in memory and executed by a processor of a processing device. For example, certain functionality disclosed herein can be implemented at least in part in the form of software.

The particular configurations of information processing systems described herein are exemplary only, and a given such system in other embodiments may include other elements in addition to or in place of those specifically shown, including one or more elements of a type commonly found in a conventional implementation of such a system.

For example, in some embodiments, an information processing system may be configured to utilize the disclosed techniques to provide additional or alternative functionality in other contexts.

The processes described herein are illustrated as a collection of steps in a logical flow, which represent a sequence of operations that can be implemented in hardware, software, or a combination thereof. In the context of software, the steps represent computer-executable instructions stored on one or more computer-readable storage media that, when executed by one or more processors, perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described blocks can be combined in any order and/or in parallel to implement the processes. Moreover, some of the operations can be repeated during the process.

The disclosure sets forth example embodiments and, as such, is not intended to limit the scope of embodiments of the disclosure and the appended claims in any way. Embodiments have been described above with the aid of functional building blocks illustrating the implementation of specified components, functions, and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined to the extent that the specified functions and relationships thereof are appropriately performed.

The foregoing description of specific embodiments will so fully reveal the general nature of embodiments of the disclosure that others can, by applying knowledge of those of ordinary skill in the art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of embodiments of the disclosure. Therefore, such adaptation and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. The phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the specification is to be interpreted by persons of ordinary skill in the relevant art in light of the teachings and guidance presented herein.

The breadth and scope of embodiments of the disclosure should not be limited by any of the above-described example embodiments but should be defined only in accordance with the following claims and their equivalents.

Conditional language, such as, among others, “can,” “could,” “might,” or “may,” unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain implementations could include, while other implementations do not include, certain features, elements, and/or operations. Thus, such conditional language generally is not intended to imply that features, elements, and/or operations are in any way required for one or more implementations or that one or more implementations necessarily include logic for deciding, with or without user input or prompting, whether these features, elements, and/or operations are included or are to be performed in any particular implementation.

A person of ordinary skill in the art will recognize that any process or method disclosed herein can be modified in many ways. The process parameters and sequence of the steps described and/or illustrated herein are given by way of example only and can be varied as desired. For example, while the steps illustrated and/or described herein may be shown or discussed in a particular order, these steps do not necessarily need to be performed in the order illustrated or discussed.

The various exemplary methods described and/or illustrated herein may also omit one or more of the steps described or illustrated herein or comprise additional steps in addition to those disclosed. Further, a step of any method as disclosed herein can be combined with any one or more steps of any other method as disclosed herein.

It is, of course, not possible to describe every conceivable combination of elements and/or methods for purposes of describing the various features of the disclosure, but those of ordinary skill in the art recognize that many further combinations and permutations of the disclosed features are possible. Accordingly, various modifications may be made to the disclosure without departing from the scope or spirit thereof. Further, other embodiments of the disclosure may be apparent from consideration of the specification and annexed drawings, and practice of disclosed embodiments as presented herein. Examples put forward in the specification and annexed drawings should be considered, in all respects, as illustrative and not restrictive. Although specific terms are employed herein, they are used in a generic and descriptive sense only, and not used for purposes of limitation.

Unless otherwise noted, the terms “connected to” and “coupled to” (and their derivatives), as used in the specification, are to be construed as permitting both direct and indirect (i.e., via other elements or components) connection. In addition, the terms “a” or “an,” as used in the specification, are to be construed as meaning “at least one of” Finally, for ease of use, the terms “including” and “having” (and their derivatives), as used in the specification, are interchangeable with and have the same meaning as the word “comprising.”

From the foregoing, and the accompanying drawings, it will be appreciated that, although specific implementations have been described herein for purposes of illustration, various modifications may be made without deviating from the spirit and scope of the appended claims and the elements recited therein. In addition, while certain aspects are presented below in certain claim forms, the inventors contemplate the various aspects in any available claim form. For example, while only some aspects may currently be recited as being embodied in a particular configuration, other aspects may likewise be so embodied. Various modifications and changes may be made as would be obvious to a person skilled in the art having the benefit of this disclosure. It is intended to embrace all such modifications and changes and, accordingly, the above description is to be regarded in an illustrative rather than a restrictive sense.

It should again be emphasized that the embodiments of the invention as described herein are intended to be illustrative only. Other embodiments of the invention can be implemented utilizing a wide variety of different types and arrangements of information processing systems, networks and processing devices than those utilized in the particular illustrative embodiments described herein, and in numerous alternative processing contexts. In addition, the particular assumptions made herein in the context of describing certain embodiments need not apply in other embodiments. These and numerous other alternative embodiments will be readily apparent to those skilled in the art. 

What is claimed is:
 1. A method comprising: obtaining data from an inertial measurement unit of a user; generating three-dimensional arm pose estimates from the obtained data; applying the generated three-dimensional arm pose estimates to a machine learning system trained to recognize temporal-spatial patterns of one or more designated activities; and obtaining at least one classification output from the machine learning system; wherein the method is performed by at least one processing device comprising a processor coupled to a memory.
 2. The method of claim 1 wherein the inertial measurement unit is part of a wearable device of the user.
 3. The method of claim 1 wherein the at least one processing device comprises a wearable device of the user.
 4. The method of claim 1 wherein the at least one processing device comprises a smartphone of the user that is in communication with a wearable device of the user, the wearable device comprising the inertial measurement unit.
 5. The method of claim 1 wherein the one or more designated activities comprise eating activities.
 6. The method of claim 1 wherein the machine learning system is trained on one or more sets of three-dimensional arm pose estimate training data using supervised learning.
 7. The method of claim 1 further comprising generating at least one control signal based at least in part on the classification output to guide the user toward one or more target activities.
 8. The method of claim 7 wherein the one or more target activities comprise one or more activities associated with a particular desired health condition of the user.
 9. The method of claim 1 wherein a given one of the three-dimensional arm pose estimates comprises a combination of an estimate of a wrist orientation relative to a body and an estimate of an elbow orientation relative to the body.
 10. The method of claim 1 wherein obtaining data from an inertial measurement unit of a user comprises obtaining wrist orientation data from the inertial measurement unit.
 11. The method of claim 1 wherein generating three-dimensional arm pose estimates from the obtained data comprises, for each of a plurality of body directions, calculating wrist orientation relative to a torso coordinate system using wrist orientation relative to an earth coordinate system and the body direction, looking up the calculated wrist orientation relative to the torso coordinate system in a weighted dictionary generated using actual arm poses to determine a weighted point cloud, and utilizing the weighted point cloud to assign a probability to the body direction based at least in part on weights of the weighted point cloud.
 12. The method of claim 11 wherein generating three-dimensional arm pose estimates from the obtained data further comprises: selecting a particular one of the body directions based at least in part on their respective assigned probabilities; and utilizing the selected body direction to transform the wrist orientation relative to the earth coordinate system to determine a derived wrist orientation relative to the torso coordinate system.
 13. The method of claim 12 wherein generating three-dimensional arm pose estimates from the obtained data further comprises: looking up the derived wrist orientation relative to the torso coordinate system in the weighted dictionary to determine a weighted point cloud; and determining a given one of the three-dimensional arm pose estimates based at least in part on weights of the weighted point cloud.
 14. The method of claim 1 wherein the machine learning system comprises at least one support vector machine (SVM) model, and wherein applying the generated three-dimensional arm pose estimates to the machine learning system activities comprises: extracting possible intake gestures of the generated three-dimensional arm pose estimates into respective segments; resampling each of at least a subset of the extracted segments; and utilizing the SVM model to classify whether or not each of one or more of the extracted and resampled segments comprises an intake gesture.
 15. A system comprising: at least one processing device comprising a processor coupled to a memory; the processing device being configured: to obtain data from an inertial measurement unit of a user; to generate three-dimensional arm pose estimates from the obtained data; to apply the generated three-dimensional arm pose estimates to a machine learning system trained to recognize temporal-spatial patterns of one or more designated activities; and to obtain at least one classification output from the machine learning system.
 16. The system of claim 15 wherein generating three-dimensional arm pose estimates from the obtained data comprises: for each of a plurality of body directions, calculating wrist orientation relative to a torso coordinate system using wrist orientation relative to an earth coordinate system and the body direction, looking up the calculated wrist orientation relative to the torso coordinate system in a weighted dictionary generated using actual arm poses to determine a weighted point cloud, and utilizing the weighted point cloud to assign a probability to the body direction based at least in part on weights of the weighted point cloud; selecting a particular one of the body directions based at least in part on their respective assigned probabilities; utilizing the selected body direction to transform the wrist orientation relative to the earth coordinate system to determine a derived wrist orientation relative to the torso coordinate system; looking up the derived wrist orientation relative to the torso coordinate system in the weighted dictionary to determine a weighted point cloud; and determining a given one of the three-dimensional arm pose estimates based at least in part on weights of the weighted point cloud.
 17. The system of claim 15 wherein the machine learning system comprises at least one support vector machine (SVM) model, and wherein applying the generated three-dimensional arm pose estimates to the machine learning system comprises: extracting possible intake gestures of the generated three-dimensional arm pose estimates into respective segments; resampling each of at least a subset of the extracted segments; and utilizing the SVM model to classify whether or not each of one or more of the extracted and resampled segments comprises an intake gesture.
 18. A computer program product comprising a non-transitory processor-readable storage medium having stored therein program code of one or more software programs, wherein the program code, when executed by at least one processing device comprising a processor coupled to a memory, causes the processing device: to obtain data from an inertial measurement unit of a user; to generate three-dimensional arm pose estimates from the obtained data; to apply the generated three-dimensional arm pose estimates to a machine learning system trained to recognize temporal-spatial patterns of one or more designated activities; and to obtain at least one classification output from the machine learning system.
 19. The computer program product of claim 18 wherein generating three-dimensional arm pose estimates from the obtained data comprises: for each of a plurality of body directions, calculating wrist orientation relative to a torso coordinate system using wrist orientation relative to an earth coordinate system and the body direction, looking up the calculated wrist orientation relative to the torso coordinate system in a weighted dictionary generated using actual arm poses to determine a weighted point cloud, and utilizing the weighted point cloud to assign a probability to the body direction based at least in part on weights of the weighted point cloud; selecting a particular one of the body directions based at least in part on their respective assigned probabilities; utilizing the selected body direction to transform the wrist orientation relative to the earth coordinate system to determine a derived wrist orientation relative to the torso coordinate system; looking up the derived wrist orientation relative to the torso coordinate system in the weighted dictionary to determine a weighted point cloud; and determining a given one of the three-dimensional arm pose estimates based at least in part on weights of the weighted point cloud.
 20. The computer program product of claim 18 wherein the machine learning system comprises at least one support vector machine (SVM) model, and wherein applying the generated three-dimensional arm pose estimates to the machine learning system comprises: extracting possible intake gestures of the generated three-dimensional arm pose estimates into respective segments; resampling each of at least a subset of the extracted segments; and utilizing the SVM model to classify whether or not each of one or more of the extracted and resampled segments comprises an intake gesture. 